ec-geolink / design

Design information about the EarthCube Geolink project.
8 stars 1 forks source link

Create test corpus for person & organization script #45

Closed amoeba closed 9 years ago

amoeba commented 9 years ago

As of creating this issue, my person/org script processes a very small (64 docs) set of scientific metadata and that set of documents is relatively clean and doesn't represent the type of data we'd expect to see with the rest of the scimeta.

amoeba commented 9 years ago

The types of matches the script should make, for which there should be corresponding tests, are things like:

TODO:

amoeba commented 9 years ago

First two tasks done. See 748bdf9aca8bc40a081a8ce09fe6ea2abe4fa0d9

amoeba commented 9 years ago

Switched directions here in 5de83323d290491c88c0ed7ddb8fa01d0877c7d0.

For now, the script just pulls out all relevant fields for resolving people and organizations. This results in a lot of duplicates. The idea is that we can then perform record linkage on the output. As of right now, the approach uses the Python package dedupe but I still need to get the full DataOne scientific meta corpus downloaded so I can see what we're dealing with in terms of designing a good method for record linkage.

amoeba commented 9 years ago

@mbjones furnished me with a dump of all of D1 so the test corpus has expanded many times over.

The script lives at https://github.com/ec-geolink/design/tree/d1-people/data/dataone/people/link. It doesn't currently use dedupe but does do record linkage for obviously same records. In that folder, I run these make tasks in the following order to generate two LOD graphs (people+orgs). Note that these assume you have a folder with scientific metadata somewhere on your local machine.

make dump
make prune
make unique
make graph

(pardon that poor use of makefile)