RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Add data description and provenance for "SMPDB PubMed annotations" (`SMPDB_pubmed_IDs.csv`) #366

Open d33bs opened 5 months ago

d33bs commented 5 months ago

This issue highlights a need to provide description and provenance for a file used to run the full RTX-KG2 pipeline, SMPDB_pubmed_IDs.csv. This file appears to be required for a full workflow run of RTX-KG2. I believe this is referenced in the RTX-KG2 article under Table 6, row 4, and the Acknowledgements section as "We thank David Wishart and Carin Li for providing a download link for the SMPDB PubMed annotations ...". This file could benefit from being added to the list of data sources including how it was generated (any additional data sources or code) and how it may be requested or permitted for use (for example, if any specific licensing applies). Apologies in advance if I misunderstand the nature of this data as it pertains to RTX-KG2.

saramsey commented 4 months ago

Hi @d33bs thank you for pointing this out. The use of SMPDB_pubmed_IDs.csv is indeed not ideal from the standpoints of reproducibility and transparency. I can get you a copy of that file if you like (reach out to me by email and I will set it up). Your suggestion of better documenting what this file is, and how it can be obtained, is a good one; we will add that info to the RTX-KG2 documentation.

d33bs commented 4 months ago

Thank you @saramsey !