dsi-bdi / biokg

A Knowledge Graph for Relational Learning On Biological Data
Other
75 stars 26 forks source link

Loading BioKG in Neo4j #4

Open DimitrisAlivas opened 2 years ago

DimitrisAlivas commented 2 years ago

Hey folks,

First of all, I'd like to thank you for this contribution. Having a unified biomedical KG is an essential resource for research in this domain.

I would like to use BioKG in my work. Specifically, we would like to train a link predictor to perform the task of drug-target interaction prediction and utilise the benchmarks you so thoughtfully include, in order to compare the performance of our DTI approach vs others.

For this, I thought it would be useful to have the BioKG final data (in /data/biokg/) uploaded to a Neo4j property graphstore, to enable querying for specific benchmarks (using hyper-relations for example: a relation DTI with qualifier (benchmark: 'FDA') or DDI with qualifier (benchmark: MINERAL)). Furthermore, having BioKG as a Neo4j ready graph could increase usability and visibility, so I plan on making it public once I manage to get it done.

The 2 issues I'm facing:

1) The number of unique entities/relations that I see after loading the .tsv data in Pandas is different than the ones reported in the paper, so I've been looking into what could've gone wrong.

2) The way I create the Neo4j graph is as follows:

Following the logic above everything runs smoothly up to the point where I try to load the links that include COMPLEXES + PATHWAYs for which I cannot find any matches for.

If I understand the data model correctly, complex_ids exist only as part of the LINKS file and do not appear in the properties + metadata files (?).

Which identifiers are the ones that I should use to create the unique Complex nodes?

Apologies for the lengthy post and for potential inaccuracies on my end.

Minor comment: A typo I found while reading your documentation: https://github.com/dsi-bdi/biokg/blob/92a71e71fa11411f44bc9d5abfdeb0ef9ec986d4/links_description.txt#L93

The relation should be PROTEIN_DISEASE if I'm not mistaken.

Thank you again for your great contribution! I would greatly appreciate any help :-)

Cheers!

samehkamaleldin commented 2 years ago

Hi Dimitris,

Thanks a lot for the great description and details mentioned in this issue, It has been a good year since I have last touched on this project and I have moved a few jobs now. However, I want to provide you with some support in relation to the issues you have.

I am going to give you a very lazy answer now and probably in a few days I can look at this more carefully and give you a better answer.

In relation to issue 1, have you tried to use the ready-produced KG located in the releases section? It should be the same as in the paper. My quick guess is that this problem can be caused by a change in a source dataset. I vaguely remember for example that DrugBank and Reactome made some changes after we published this changed the output of our script which was reported in the paper.

I could not get issue 2 properly, so I will try to look at it again later and try to give you an answer.

In relation to the typo, Thanks for noticing that. It is a small thing I know, but would you be kind and change it and make a pull request? I will accept it immediately.

Thanks a lot. Sameh

DimitrisAlivas commented 2 years ago

Hey Sameh,

Thank you for taking the time and for your answer! I think it makes a lot of sense given the frequency of updates in a lot of the integrated data sources (e.g. DrugBank as you mentioned)

I used the instruction on the readme of the repository to compile biokg in order to make sure it includes the latest versions of the sources. I will also check the version in the releases as you suggest, the goal here is to get the most accurate, hence up-to-date graph for our experiments.

Regarding issue 2, it's related to the semantics of pathways, complexes in relation to proteins, cause they do affect the way I convert the data for Neo4j. Thanks for taking some more time to look into it.

Best, Dimitrios