Fixes https://github.com/Graphlet-AI/graphlet/issues/10

We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add SAME_AS and NOT_SAME_AS edges using the labels outlined below.

DBLP Training Data

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

DBLP Dataset is available at https://dblp.org/xml/dblp.xml.gz.
DBLP Dataset 2 by Prof. Dr. Felix Naumann available in DBLP10k.csv is a set of 10K labels (5K true, 5K false) for pairs of authors. We use it to train our entity resoultion model.

Note that there are additional labels available as XML that we haven't parsed yet at:

Felix Nauman's DBLP Dataset 1 is available in dblp50000.xml

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp

Graphlet-AI / graphlet

Create a DBLP network with SAME_AS edges as training data for our ER model #9

DBLP Training Data

Collecting and Preparing the Training Data