Graphlet-AI / graphlet

PyPi module for Graphlet AI Knowledge Graph Factory
https://graphlet.ai
Apache License 2.0
28 stars 1 forks source link

Create a DBLP network with SAME_AS edges as training data for our ER model #9

Closed rjurney closed 1 year ago

rjurney commented 2 years ago

Fixes https://github.com/Graphlet-AI/graphlet/issues/10

We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add SAME_AS and NOT_SAME_AS edges using the labels outlined below.

DBLP Training Data

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

Note that there are additional labels available as XML that we haven't parsed yet at:

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp