OpenBioLink / OpenBioLink

OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data.
MIT License
142 stars 23 forks source link

Improve data formats for network download #4

Closed dhimmel closed 4 years ago

dhimmel commented 4 years ago

I wasn't able to find much documentation of the contents of HQ_DIR.zip (or the other three benchmark datasets) in the readme or in the zip archive.

The head of graph_files/nodes.csv is:

DRUG_53326634   DRUG
DRUG_11538251   DRUG
DRUG_174597 DRUG

The head of graph_files/edges.csv is:

DRUG_774        DRUG_REACTION_GENE      GENE_150        899
DRUG_5815       DRUG_REACTION_GENE      GENE_5739       899
DRUG_6022       DRUG_REACTION_GENE      GENE_101929876  989

Given that you want this network to become a benchmark, it's important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:

Furthermore, I think the TSV format could be improved:

matthias-samwald commented 4 years ago

Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I'm a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We'll also discuss internally.

cthoyt commented 4 years ago

I would like to see the nodes and edges files using CURIEs that are resolvable with Identifiers.org (see Juty et al.; Identifiers.org and MIRIAM Registry: Community resources to provide persistent identification. Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs? The entity annotations seem to be present in the nodes.csv file for lookup later either way.

As far as adding several data formats: I agree with @dhimmel and would go as far as saying that adding several formats is a must.

As a downstream user of Hetionet, I just wanted to use the network that was already prepared for me. If those weren't available, then I would have to duplicate huge amounts of efforts of other scientists using the same data to make it in format that was more useful.

dhimmel commented 4 years ago

Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs?

@cthoyt are CURIEs the format you can view at https://registry.identifiers.org/ under Sample Compact identifier? Examples include sider.drug:2244, go:0006915, reactome:R-HSA-201451, bgee.organ:EHDAA:2185.

I agree these are ideal for identifying nodes. Especially cool that you can resolve them so easily like https://identifiers.org/bgee.organ:EHDAA:2185. I think it'd be useful to have these as a node property, if not the primary identifier. I did notice that some resources like DrugCentral weren't in the registry. I think the nodes TSV could have extra columns to accommodate this information, like node_type, curie. Not sure how that would affect compatibility with the link prediction packages @matthias-samwald is thinking of. @matthias-samwald are you primarily thinking of pyKEEN?

I'm a bit reluctant to add more large files to the distribution.

One thing you could look into is using git LFS for large file storage. This would allow you to version large files as part of the repo and allow users to browse the data contents from the GitHub UI rather than having to download and extract the zip archive. Git LFS is pricey on GitHub, but my experience is that you can request an education discount which provides coupons that make it free of charge. Combined with compression, the storage should be manageable.

cthoyt commented 4 years ago

Yes @dhimmel that's right. CURIEs are awesome. Here are some examples to make it more obvious how they're formed:

Namespace Identifier CURIE Identifiers.org
hgnc 6893 hgnc:6893 https://identifiers.org/hgnc:6893
hgnc.symbol MAPT hgnc.symbol:MAPT https://identifiers.org/hgnc.symbol:MAPT
pubmed 28936969 pubmed:28936969 https://identifiers.org/pubmed:28936969
GO 0006915 GO:0006915 https://identifiers.org/GO:0006915

All of these namespaces can be looked up in the registry at https://registry.identifiers.org/ too. It's not a complete resource, but it's always possible to suggest more stuff! EBI is pretty good about feedback.

I think internally we have PyKEEN reindex everything, so you can stick in any strings as identifiers you want

matthias-samwald commented 4 years ago
matthias-samwald commented 4 years ago

→ We have updated the data documentation and added RDF as an additional data format that can be readily important into a wide variety of tools and graph database systems. The choice of RDF also eases the integration of OpenBioLink resources with other Linked Data resources that we and some of our collaborators have created or worked with in the past.

dhimmel commented 4 years ago

We have updated the data documentation

Confirming that I see improved documentation in the README under TSV Writer.

Switch from internal prefixes to CURIEs for main TSV file

Great to see this migration to CURIEs! I'm excited to start using them for my work. Finally a system for interoperability without too much overhead.