Improve data formats for network download

dhimmel commented 4 years ago

I wasn't able to find much documentation of the contents of HQ_DIR.zip (or the other three benchmark datasets) in the readme or in the zip archive.

The head of graph_files/nodes.csv is:

DRUG_53326634   DRUG
DRUG_11538251   DRUG
DRUG_174597 DRUG

The head of graph_files/edges.csv is:

DRUG_774        DRUG_REACTION_GENE      GENE_150        899
DRUG_5815       DRUG_REACTION_GENE      GENE_5739       899
DRUG_6022       DRUG_REACTION_GENE      GENE_101929876  989

Given that you want this network to become a benchmark, it's important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:

add column headers for self-documentation
add readme that include the head of files and explain what the rows correspond to

Furthermore, I think the TSV format could be improved:

an extra column could indicate the source database, important to comply with upstream licenses
additional formats could be added that are more expressive. For Hetionet, we release the data in four formats: JSON, Neo4j, TSV (similar to what OpenBioLink does now), and matrix. The JSON was created using the hetnetpy package and the matrices using hetmatpy. I think at least the JSON format would be a valuable addition for OpenBioLinks as its more self-documenting, allows everything to be stored in a single file, and allows for storing node/edge properties like source and confidence scores.

matthias-samwald commented 4 years ago

Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I'm a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We'll also discuss internally.

cthoyt commented 4 years ago

I would like to see the nodes and edges files using CURIEs that are resolvable with Identifiers.org (see Juty et al.; Identifiers.org and MIRIAM Registry: Community resources to provide persistent identification. Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs? The entity annotations seem to be present in the nodes.csv file for lookup later either way.

As far as adding several data formats: I agree with @dhimmel and would go as far as saying that adding several formats is a must.

As a downstream user of Hetionet, I just wanted to use the network that was already prepared for me. If those weren't available, then I would have to duplicate huge amounts of efforts of other scientists using the same data to make it in format that was more useful.

dhimmel commented 4 years ago

Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs?

@cthoyt are CURIEs the format you can view at https://registry.identifiers.org/ under Sample Compact identifier? Examples include sider.drug:2244, go:0006915, reactome:R-HSA-201451, bgee.organ:EHDAA:2185.

I agree these are ideal for identifying nodes. Especially cool that you can resolve them so easily like https://identifiers.org/bgee.organ:EHDAA:2185. I think it'd be useful to have these as a node property, if not the primary identifier. I did notice that some resources like DrugCentral weren't in the registry. I think the nodes TSV could have extra columns to accommodate this information, like node_type, curie. Not sure how that would affect compatibility with the link prediction packages @matthias-samwald is thinking of. @matthias-samwald are you primarily thinking of pyKEEN?

I'm a bit reluctant to add more large files to the distribution.

One thing you could look into is using git LFS for large file storage. This would allow you to version large files as part of the repo and allow users to browse the data contents from the GitHub UI rather than having to download and extract the zip archive. Git LFS is pricey on GitHub, but my experience is that you can request an education discount which provides coupons that make it free of charge. Combined with compression, the storage should be manageable.

cthoyt commented 4 years ago

Yes @dhimmel that's right. CURIEs are awesome. Here are some examples to make it more obvious how they're formed:

Namespace	Identifier	CURIE	Identifiers.org
hgnc	6893	hgnc:6893	https://identifiers.org/hgnc:6893
hgnc.symbol	MAPT	hgnc.symbol:MAPT	https://identifiers.org/hgnc.symbol:MAPT
pubmed	28936969	pubmed:28936969	https://identifiers.org/pubmed:28936969
GO	0006915	GO:0006915	https://identifiers.org/GO:0006915

All of these namespaces can be looked up in the registry at https://registry.identifiers.org/ too. It's not a complete resource, but it's always possible to suggest more stuff! EBI is pretty good about feedback.

I think internally we have PyKEEN reindex everything, so you can stick in any strings as identifiers you want

matthias-samwald commented 4 years ago

[x] Improve documentation of TSV file columns in README file (adding a header row might be problematic for some standard link prediction packages I think)
[x] Switch from internal prefixes to CURIEs for main TSV file
[x] Additional download option: RDF serialized as N3, using Indentifiers.org URIs
[x] Add data source as fifth column in TSV file
[x] Update documentation of file formats in README

matthias-samwald commented 4 years ago

→ We have updated the data documentation and added RDF as an additional data format that can be readily important into a wide variety of tools and graph database systems. The choice of RDF also eases the integration of OpenBioLink resources with other Linked Data resources that we and some of our collaborators have created or worked with in the past.

dhimmel commented 4 years ago

We have updated the data documentation

Confirming that I see improved documentation in the README under TSV Writer.

Switch from internal prefixes to CURIEs for main TSV file

Great to see this migration to CURIEs! I'm excited to start using them for my work. Finally a system for interoperability without too much overhead.

OpenBioLink / OpenBioLink

Improve data formats for network download #4