Section | Description |
---|---|
Installing | Installing the requirements |
Downloading | Downloading the data |
Preparing | Preparing the JSONs |
Running | Running RML |
Querying | Linked Data Fragments endpoint |
Analyzing | Knowledge Graph Applications |
Generating the COVID KG requires a few dependencies. Firstly, Python 3 should be installed with the following libraries:
SPARQLWrapper
nltk
numpy
pandas
requests
scipy
sklearn
tqdm
Further, to run RML, you will need:
npm i @rmlio/yarrrml-parser -g
The dataset can be retrieved from Kaggle. After downloading the dataset, re-arrange the files to have the following directory structure:
data
|-metadata.csv
|-papers
|-<PAPER_1>.json
|-...
|-<PAPER_N>.json
Make sure you have a directory called sample
with a papers
directory in there. Then run python3 scripts/generate_sample_data
.
To create a bag of words for title, abstract and body, run python3 scripts/create_bow.py <INPUT_DIR> <OUTPUT_DIR>
. As an example, you could run: python3 scripts/create_bow.py sample output
.
Run python3 scripts/map_entities.py <INPUT_DIR> <OUTPUT_DIR>
to generate different pickled dictionaries with the following structure: {string: URI}
.
Run python3 scripts/get_db_resources.py <INPUT_DIR> <OUTPUT_DIR>
to get the dbpedia ntriple files of the known resources. The
To add the information of the known resources in the paper's json representation, run python3 scripts/ountry_institution_json.py <INPUT_DIR> <PICKLE_DIR> <OUTPUT_DIR>
. Iteratively, this script will add the country and institution external links to the json dictionaries of all files in the INPUT_DIR.
To add the external links to the metadata.csv file run python3 scripts/csv_transform.py <INPUT_DIR> <PICKLE_DIR> <OUTPUT_DIR>
. This script will add an additional column with the journal dbpedia link.
After preparing the JSONs, we can convert them to RDF using RML.
The python3 scripts/loop.py <INPUT_DIR> <JOBS>
script shows how this transformation can be performed in python, using external commands:
yarrrml-parser -i rules.yml -o rules.rml.ttl
java -jar /path/to/rmlmapper.jar -m rules.rml.ttl
In this script, all json files from the INPUT_DIR are first copied to the tmp/ folder. This is the source entrypoint defined by our yarrrml script. You can change this location by changing the sources in the rule.yml
file.
This conversion can be exectued in parallel and the
Analogue, the metadata.csv and bow.json can be transformed to RDF by using the corresponding yml files in the RML folder.
yarrrml-parser -i mapping-csv.yml -o csv.rml.ttl
java -jar /path/to/rmlmapper.jar -m csv.rml.ttl -o <DIR>/metadata.nt
yarrrml-parser -i mapping-bow.yml -o csv.rml.ttl
java -jar /path/to/rmlmapper.jar -m csv.rml.ttl -o <DIR>bow.nt
Executing all these rmlmapper commands result in a large set of .nt
files. All of them were combined in one sigle file to represent the KG.
Simply concat them using the following bash command:
for i in *.nt;do cat $i >> kg.nt;done
We are hosting an endpoint that can be used for querying here. The corresponding repository for this can be found here.
A paper on this work has been accepted to the resource track of ISWC2019! Our paper will be made available soon. If you use the COVID-KG in scientific work, we would appreciate citations:
"Steenwinckel B., Vandewiele G., Rausch I., Heyvaert P., Taelman R., Colpaert P., Simoens P., Dimou A., De Turck F. and Ongenae F. Facilitating COVID-19 Meta-analysis Through a Literature Knowledge Graph. In Proc. of 19th International Semantic Web Conference (ISWC), 2-6 November 2020 (accepted)"
or
@inproceedings{covid_kg,
title={Facilitating COVID-19 Meta-analysis Through a Literature Knowledge Graph},
author={Bram Steenwinckel and Gilles Vandewiele and
Ilja Rausch and Pieter Heyvaert and
Pieter Colpaert and Pieter Simoens and
Anastasia Dimou and Filip De Turkc and
Femke Ongenae},
booktitle={Accepted in Proc. of 19th International Semantic Web Conference (ISWC)},
year={2020}
}
This has been a collaboration between a lot of people: