GillesVandewiele / COVID-KG

Creating a Knowledge Graph of 44000 COVID-19 scholarly articles
https://gillesvandewiele.github.io/COVID-KG/
Other
21 stars 0 forks source link

Issue with the Ntriple file from Kaggle #1

Open vemonet opened 4 years ago

vemonet commented 4 years ago

First I would like to thank you for this KG and its documentation!

I tried to deploy your Notebooks on my infrastructure (in a Jupyterlab with root user)

I faced issues when loading the provided ntriples file from Kaggle: https://www.kaggle.com/group16/covid19-literature-knowledge-graph

Not sure if the encoding issue is due to my environment (running Ubuntu 18.04)

I found a rather clean way to solve those issues:

Or keep langString and use english tag as default

find ugent-covid-kg.ttl -type f -exec sed -i "s/\^\^rdf:langString/@en/g" {} +



I uploaded the Notebooks to this GitHub repository and detailed the process to download the ntriples: https://github.com/MaastrichtU-IDS/covid-kg-notebooks/#download-data

I loaded the graph in a GraphDB triplestore, it can be browsed and URI resolved using this web browser:
http://trek.semanticscience.org/describe?uri=http://idlab.github.io/covid19#ffe663e4ef5018da41f057533520b9d85ec86e18&endpoint=https://graphdb.dumontierlab.com/repositories/covid-kg

I will add search index and [HCLS descriptive metadata](https://www.w3.org/TR/hcls-dataset/) soon if you are interested
GillesVandewiele commented 4 years ago

Legend! Thanks for writing this out, we will try to integrate this in our pipeline so that the issue is resolved for next versions.

vemonet commented 4 years ago

Hi, I noticed that the latest version available on Kaggle seems to have solved those encoding issues, thanks!

The version 11 file is half the size (500M) of the version 9 (1G)

I cannot find Mesh keywords in the latest version (previously defined using http://idlab.github.io/covid19#paragraphEntities )

We can only find dbpedia mappings defined using http://idlab.github.io/covid19#hasConcept

Is it normal?

GillesVandewiele commented 4 years ago

Hi, @bsteenwi made some changes to the final version to reduce the size. He did indeed remove some of the relations, but I am not sure which ones exactly...

bsteenwi commented 4 years ago

Hi, the last version of the KG does indeed mis some links. We have recreated the KG with concepts extracted from dbpedia spotlight and tried to find correlations between papers based on these concepts.

I will update the mapping scripts in this repository, so it easier to see which relations are available

vemonet commented 4 years ago

Ok, we were planning to integrate your KG to the Mesh vocabulary and complementary resources (other publications KG about covid, drug, pathways db, etc). And are less interested in the dbpedia mappings (mainly due to data quality issues)

Do you know if you plan to make MeSH annotations available again soon? I could take a look into re-executing the code you wrote to generate it, but if you plan to put it back, that would be even better :)

A small note also: for MeSH URI you are using HTTPS (e.g. https://id.nlm.nih.gov/mesh/D007251) Mesh vocabulary and prefix.cc uses HTTP (http://id.nlm.nih.gov/mesh/)

Thanks!