ecohealthalliance / HP3

Repository for Host-Pathogen Phylogeny Project. Paper DOI: 10.1038/nature22975
https://dx.doi.org/10.1038/nature22975
MIT License
15 stars 6 forks source link

suggest to introduce direct link between associations.csv and references. #15

Closed jhpoelen closed 4 years ago

jhpoelen commented 6 years ago

hi @arw36 - I took some time to review the data files associations.csv, references.txt, hosts.csv and viruses.csv. In associations.csv short citations are mentioned (e.g., Alexander et al. 1994), but only full citations can be found in references.txt . I'd like to suggest to apply a similar naming scheme as virus/host names such that associations.csv explicitly links with entries in references.txt .

Perhaps something like a two column references.csv:

citationkey citation
alexander_et_al_1994 Alexander KA (1994). Serologic survey of selected canine pathogens among free-ranging jackals in Kenya. Journal of wildlife diseases 30:486-491.

where alexander_et_al_1994 would be mentioned in the "Reference" column of associations.csv .

In my opinion, introducing such an explicit link would make the citations a bit friendlier for analysis or integration into projects like https://globalbioticinteractions.org .

arw36 commented 6 years ago

Thanks, @jhpoelen. Totally agree, though this highlights some data cleaning that we will need to be done before the data files are linked properly. I am slowly working on this.

jhpoelen commented 5 years ago

related to https://github.com/jhpoelen/eol-globi-data/issues/375 .

jhpoelen commented 4 years ago

Hi @arw36 - I was hoping to take a stab at indexing your virus-host data via GloBI as part of a COVID-19 effort . Any way I can convince you to include a single table that joins all info in virus, host, reference and associations files? This would make indexing your data much easier. For an example of such a single table please have a look at:

https://github.com/globalbioticinteractions/template-dataset/blob/master/interactions.tsv

If you do not have time, please do let me know, so I can consider alternate approaches.

arw36 commented 4 years ago

Hi @jhpoelen,

In spring of 2018, I started to add the reference_key field to link references and associations datasets in the branch Spring_DataClean as well as started to tackle orphan or incomplete ref data as outlined in #17, #16, and #14. The amount of time to resolve, or if resolution is even possible for, these issues is tough to estimate. @noamross or @kevinolival may be better positioned to answer status of data cleaning.

Stay well, Anna

arw36 commented 4 years ago

I went ahead and made the interactions sheet in my fork of the database (as I have moved on from EHA and no longer actively working on this project). The script to create this file is here. Let me know if this suffices as a quick-fix. I guess we could use clarity on the ability to update the interactions in the globi database if further data cleaning occurs (eg currently 152 need to be reviewed). However data link issues are limited to a small amount (~5% of the database) and would hate for these stragglers to prohibit moving this forward. Of course, permissions with original authors should be confirmed before integration.

jhpoelen commented 4 years ago

@arw36 thank you for sharing! Hoping to have a look at it in the next couple of days. Apologies for the delay, I've been distracted by other activities.

jhpoelen commented 4 years ago

@arw36 just to let you know that I haven't forgotten about this. Hope to update soon.

jhpoelen commented 4 years ago

@arw36 I took a first stab at indexing the data you've prepared. See https://github.com/globalbioticinteractions/olival2017 for indexing configuration. First pass worked like charm thanks to your help in preprocessing the data!

Some notes:

  1. taxon hierarchies - I noticed that hosts.csv and viruses.csv contains pretty neat taxonomic hierarchies. With some additional joining and column mapping s (e.g., vOrder -> sourceTaxonOrder, vGenus -> sourceTaxonGenus, hOrder -> targetTaxonOrder), GloBI can also take these hierarchies into account
  2. I noticed that the interaction term "parasite of" is used. I would suggest using the term "pathogen of" with IRI http://purl.obolibrary.org/obo/RO_0002556

Please let me know if you can help with implementing suggestions above. If not, I can try and prepare a pull requests whenever I have time to do so. Thanks!

arw36 commented 4 years ago

@jhpoelen Sure thing - seems doable. I will update this week.

Separately, I was wondering how you ensure that datasets are cited properly? You give detailed directions for citing GloBI, but no such direction for citing the data sources and providers (though they are listed if a separate reference query is given). Data aggregators notoriously lead to improper attribution. You can read the Escribano paper below for how publishers to GBIF have typically been not given credit. GBIF has started to make amazing strides to remedy, first of which is a citation guideline for both data, data providers, and GBIF itself (https://www.gbif.org/citation-guidelines). Do you have plans to establish stricter guidelines for this? Not sure if this is covered elsewhere - if appropriate I can open a separate issue on the globi page.

Escribano N, Galicia D, Ariño AH. The tragedy of the biodiversity data commons: a data impediment creeping nigher?. Database. 2018 Apr 9;2018:bay033.

jhpoelen commented 4 years ago

Hi @arw36 - Thanks for your quick response and I appreciate that you bring up the inadequate citation practices within the biodiversity data commons. I am sorry to hear that instructions on how GloBI cites the data source and providers of indexed datasets were not clear. Perhaps you can help me figure out a better way to emphasize that it is more important the cite the data providers and data reference than it is to cite GloBI.

Currently, for each indexed interaction, GloBI includes the citation of the dataset (where the data can from "physically") and the reference (the authority that makes the claim). For the Olival 2017, the dataset citation would be: Olival, K. J., Hosseini, P. R., Zambrana-Torrelio, C., Ross, N., Bogich, T. L., & Daszak, P. (2017). Host and viral traits predict zoonotic spillover from mammals. Nature, 546(7660), 646–650. doi:10.1038/nature22975. Along with this dataset reference, individual species interaction references are included: e.g., reference Medina RA, Torres-Perez F, Galeno H, Navarrete M, Vial PA, Palma RE, et al. (2009). Ecology, Genetic Diversity, and Phylogeographic Structure of Andes Virus in Humans and Rodents in Chile. Journal of virology 83:2446-2459. is associated with the interaction claim that Andes virus is a pathogen of Abrothrix longipilis.

I've included an example of what that would looks like on the basic GloBI search pages (from https://www.globalbioticinteractions.org/?accordingTo=globi%3Ahurlbertlab%2Fdietdatabase). All of this information is included in the GloBI data products also.

However, I realize that having the dataset and reference citations available does not mean that folks will cite the original data. If you have suggestions on how to emphasize that citing original datasets and associated reference is not just common courtesy, but also important for tracing chains of evidence, please let me know - I am happy to make changes to the GloBI website or associated digital artifacts.

Screenshot from 2020-04-28 10-31-32

PS Tangentially, I am working on another project that enables citing entire data networks (including associated datasets and citations) in a single, machine readable citation. This may work towards more systematic approach to preserve the provenance of data and avoid the claim that hundreds of thousands of citations cannot be included in a regular paper. This related work is documented in preprint at Elliott, M. J., Poelen, J. H., & Fortes, J. (2020, January 3). Toward Reliable Biodiversity Dataset References. https://doi.org/10.32942/osf.io/mysfp which has been accepted for publication with minor revisions last week.

jhpoelen commented 4 years ago

@arw36 PS please feel free to open a separate issue at https://github.com/globalbioticinteractions/globalbioticinteractions/issues/new to document your proposal to make the citation guidelines for GloBI more clear.

jhpoelen commented 4 years ago

@arw36 Please note that your work has contributed to the following data publication:

CETAF-DiSCCo/COVID19-TAF biodiversity-related knowledge hub working group: indexed biotic interactions and review summary (Version 0.2) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3839098

Please review the data publication (sort of like a data progress report) and let me know if you'd like to be added an an author or have other ideas on how to acknowledge the contributions you and your colleagues have made.

arw36 commented 4 years ago

Thanks Jorrit, I will migrate to email for discussion of the knowledge hub and the GLOBI repo to create a issue.

noamross commented 4 years ago

Thanks so much for making this happen, @arw36 and @jhpoelen! On Anna's recommendation I'm closing this issue.