Closed dhimmel closed 4 years ago
Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I'm a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We'll also discuss internally.
I would like to see the nodes and edges files using CURIEs that are resolvable with Identifiers.org (see Juty et al.; Identifiers.org and MIRIAM Registry: Community resources to provide persistent identification. Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs? The entity annotations seem to be present in the nodes.csv
file for lookup later either way.
As far as adding several data formats: I agree with @dhimmel and would go as far as saying that adding several formats is a must.
As a downstream user of Hetionet, I just wanted to use the network that was already prepared for me. If those weren't available, then I would have to duplicate huge amounts of efforts of other scientists using the same data to make it in format that was more useful.
Besides annotating the entity types, is there any advantage of the current form for the identifiers over using CURIEs?
@cthoyt are CURIEs the format you can view at https://registry.identifiers.org/ under Sample Compact identifier
? Examples include sider.drug:2244
, go:0006915
, reactome:R-HSA-201451
, bgee.organ:EHDAA:2185
.
I agree these are ideal for identifying nodes. Especially cool that you can resolve them so easily like https://identifiers.org/bgee.organ:EHDAA:2185. I think it'd be useful to have these as a node property, if not the primary identifier. I did notice that some resources like DrugCentral weren't in the registry. I think the nodes TSV could have extra columns to accommodate this information, like node_type
, curie
. Not sure how that would affect compatibility with the link prediction packages @matthias-samwald is thinking of. @matthias-samwald are you primarily thinking of pyKEEN?
I'm a bit reluctant to add more large files to the distribution.
One thing you could look into is using git LFS for large file storage. This would allow you to version large files as part of the repo and allow users to browse the data contents from the GitHub UI rather than having to download and extract the zip archive. Git LFS is pricey on GitHub, but my experience is that you can request an education discount which provides coupons that make it free of charge. Combined with compression, the storage should be manageable.
Yes @dhimmel that's right. CURIEs are awesome. Here are some examples to make it more obvious how they're formed:
Namespace | Identifier | CURIE | Identifiers.org |
---|---|---|---|
hgnc | 6893 | hgnc:6893 | https://identifiers.org/hgnc:6893 |
hgnc.symbol | MAPT | hgnc.symbol:MAPT | https://identifiers.org/hgnc.symbol:MAPT |
pubmed | 28936969 | pubmed:28936969 | https://identifiers.org/pubmed:28936969 |
GO | 0006915 | GO:0006915 | https://identifiers.org/GO:0006915 |
All of these namespaces can be looked up in the registry at https://registry.identifiers.org/ too. It's not a complete resource, but it's always possible to suggest more stuff! EBI is pretty good about feedback.
I think internally we have PyKEEN reindex everything, so you can stick in any strings as identifiers you want
→ We have updated the data documentation and added RDF as an additional data format that can be readily important into a wide variety of tools and graph database systems. The choice of RDF also eases the integration of OpenBioLink resources with other Linked Data resources that we and some of our collaborators have created or worked with in the past.
We have updated the data documentation
Confirming that I see improved documentation in the README under TSV Writer.
Switch from internal prefixes to CURIEs for main TSV file
Great to see this migration to CURIEs! I'm excited to start using them for my work. Finally a system for interoperability without too much overhead.
I wasn't able to find much documentation of the contents of
HQ_DIR.zip
(or the other three benchmark datasets) in the readme or in the zip archive.The head of
graph_files/nodes.csv
is:The head of
graph_files/edges.csv
is:Given that you want this network to become a benchmark, it's important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:
Furthermore, I think the TSV format could be improved: