We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add SAME_AS and NOT_SAME_AS edges using the labels outlined below.
DBLP Training Data
DBLP is a database of scholarly research in computer science.
The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.
Fixes https://github.com/Graphlet-AI/graphlet/issues/10
We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add
SAME_AS
andNOT_SAME_AS
edges using the labels outlined below.DBLP Training Data
DBLP is a database of scholarly research in computer science.
The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.
Note that there are additional labels available as XML that we haven't parsed yet at:
Collecting and Preparing the Training Data
The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via
graphlet.dblp.__main__
via: