ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Genbank lacks virus sequence links to Arctos host specimens (if available, harvestable host relationship by GloBI) #3550

Closed jhpoelen closed 2 years ago

jhpoelen commented 3 years ago

Hi!

As I was looking into indexing a recently published host-virus dataset via https://www.pnas.org/content/118/15/e2002324118 and https://github.com/globalbioticinteractions/globalbioticinteractions/issues/644 , I stumbled across https://www.ncbi.nlm.nih.gov/nuccore/EU241637 and their link to https://arctos.database.museum/guid/MSB:Mamm:210229 (see attached screenshot).

Very neat to see how all the links are pointing back and forth across the various systems (e.g., genbank <-> Arctos).

Currently, Arctos captures the links to genbank in associatedSequences. However, from the data provided, it is not clear what was sequenced. In this case, a virus (hantavirus) was extracted from the host specimen.

When dealing with associated sequences, do you keep track of the kind of association between the host specimen and the sequence, like you do with the host-parasite relations?

Ideally, I'd like to extract species interactions records from the associatedSequences, but only if the sequence documents anything other than the host itself.

Thanks for all your hard work in keeping Arctos going!

related to https://github.com/ArctosDB/arctos/issues/2121 .

Screenshot from 2021-04-06 09-35-35 Screenshot from 2021-04-06 09-35-22

dustymc commented 3 years ago

Arctos captures the links to genbank in associatedSequences.

For clarity: we "capture" the link in OtherIdentifiers (same as relationships and collector numbers and such), we share via associatedSequences.

In this case, a virus (hantavirus) was extracted from the host specimen.

I think that's just a case of failing to catalog the item of scientific interest. The virus should have been cataloged and related to the mammal. That of course doesn't always happen, and my "GenBank numbers are 'self.'" statement in #2121 seems to be wrong in this case.

do you keep track of the kind of association between the host specimen and the sequence

All identifiers carry a value from https://arctos.database.museum/info/ctDocumentation.cfm?table=ctid_references; perhaps we need a way to express this situation, which probably isn't as rare as it really should be.

campmlc commented 3 years ago

Yes, it is unfortunate that the virus community is not better at providing cataloged "voucher specimens" that we can link to. It has been very difficult to get most virologists to identify, designate, or archive a host voucher, or even when these exist, to link to them on GenBank. Much of the MSB's effort at tracking viruses extracted from mammal specimens have occurred over the past decade or more, prior to our having a parasite collection, so there are GenBank links for viruses as well as parasites that are attached directly to the mammal host with "self" relationships. Now that we have the capacity to catalog the parasites separately, that should be done and those GenBank sequences moved over to the parasite record, but that is a process that would consume quite a bit of staff time and resources. I'd be happy to try if we can identify those samples, but unfortunately this may require going record by record based on which mammals have virus-associated publications or citations. We can look for "symbiotype" in the citation, but that was not always available for legacy records. We also need a way to designate relationships in citations to alternate taxa, e.g." symbiotype of ... Taxon A(virus name)".

On Tue, Apr 6, 2021 at 10:59 AM dustymc @.***> wrote:

  • [EXTERNAL]*

Arctos captures the links to genbank in associatedSequences.

For clarity: we "capture" the link in OtherIdentifiers (same as relationships and collector numbers and such), we share via associatedSequences.

In this case, a virus (hantavirus) was extracted from the host specimen.

I think that's just a case of failing to catalog the item of scientific interest. The virus should have been cataloged and related to the mammal. That of course doesn't always happen, and my "GenBank numbers are 'self.'" statement in #2121 https://github.com/ArctosDB/arctos/issues/2121 seems to be wrong in this case.

do you keep track of the kind of association between the host specimen and the sequence

All identifiers carry a value from https://arctos.database.museum/info/ctDocumentation.cfm?table=ctid_references; perhaps we need a way to express this situation, which probably isn't as rare as it really should be.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3550#issuecomment-814278969, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBEX7YFMGEO7QPUE6LTHM4XTANCNFSM42PC4JWA .

jhpoelen commented 3 years ago

@dustymc @campmlc thanks for your prompt reply and for sharing background.

Great to hear that genbank numbers can have association types just like specimens do.

I can imagine that going back and identifying the association types for existing genbank numbers with their specimen can be quite laborious. However, through GloBI, I can perhaps provide an exhaustive list of genbank numbers associated with viruses. That said, I realize that it'll take time and effort to cross reference and double check . . . so perhaps something do to when the time is right?

It might be worth mentioning that many researchers are unaware of these rich linkages that you keep. . . I am doing my best to communicate the good work on associations. . . I guess it'll take time for it to take hold.

jldunnum commented 3 years ago

Hey Jorrit, Vast majority of our host/virus relationships/linkages are for those which we have the symbiotype specimen here at MSB. These were done manually based on our knowledge of the relationships and an effort to get virologists doing descriptions to include host info going forward. The paper attached has the recommendations for this. Best, Jon


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Jorrit Poelen @.> Sent: Tuesday, April 6, 2021 11:16 AM To: ArctosDB/arctos @.> Cc: Subscribed @.***> Subject: Re: [ArctosDB/arctos] [CONTACT] association type of the associated sequences related to host vouchers (e.g., https://arctos.database.museum/guid/MSB:Mamm:210229 https://www.ncbi.nlm.nih.gov/nuccore/EU241637) (#3550)

[EXTERNAL]

@dustymchttps://github.com/dustymc @campmlchttps://github.com/campmlc thanks for your prompt reply and for sharing background.

Great to hear that genbank numbers can have association types just like specimens do.

I can imagine that going back and identifying the association types for existing genbank numbers with their specimen can be quite laborious. However, through GloBI, I can perhaps provide an exhaustive list of genbank numbers associated with viruses. That said, I realize that it'll take time and effort to cross reference and double check . . . so perhaps something do to when the time is right?

It might be worth mentioning that many researchers are unaware of these rich linkages that you keep. . . I am doing my best to communicate the good work on associations. . . I guess it'll take time for it to take hold.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3550#issuecomment-814290503, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PAYH6B2QIMPJKRQOBS3THM6V5ANCNFSM42PC4JWA.

jhpoelen commented 3 years ago

@jldunnum thanks for sharing. Can you please share a citation string for the paper? Github issues does not keep the attachment alive.

Also, just in case y'all are feeling ambitious, I've attached a (partial) list of virus genbank numbers extract from indexed Grange et al. 2021 using elton interactions globalbioticinteractions/grange2021 | grep -P -o "https://[^\t]+nuccore[^\t]+" | sort | uniq > virus_genbank_numbers.txt

The first 10 are:

$ cat virus_genbank_numbers.txt | head
https://www.ncbi.nlm.nih.gov/nuccore/AB010730
https://www.ncbi.nlm.nih.gov/nuccore/AB010731
https://www.ncbi.nlm.nih.gov/nuccore/AB010732
https://www.ncbi.nlm.nih.gov/nuccore/AB010733
https://www.ncbi.nlm.nih.gov/nuccore/AB010734
https://www.ncbi.nlm.nih.gov/nuccore/AB010735
https://www.ncbi.nlm.nih.gov/nuccore/AB010736
https://www.ncbi.nlm.nih.gov/nuccore/AB010737
https://www.ncbi.nlm.nih.gov/nuccore/AB010738
https://www.ncbi.nlm.nih.gov/nuccore/AB010739

virus_genbank_numbers.txt

campmlc commented 3 years ago

@jhpoelen this would be most helpful" an exhaustive list of genbank numbers associated with viruses"

jhpoelen commented 3 years ago

@campmlc I shared a partial list, other GloBI indexed datasets can be used to complement this list if needed.

dustymc commented 3 years ago

perhaps something do to when the time is right

Potentially a fun project for an intern/CS student/etc.

I am doing my best to communicate the good work on associations. . . I guess it'll take time for it to take hold.

It's appreciated! We obviously aren't great at communicating what we do. We've been talking to and working with GenBank since ~2000; I'm (obviously!) not sure how to do better, but I think it'll involve more than just time.

@jldunnum your attachment didn't come through.

Related:

https://github.com/ArctosDB/arctos/issues/2151 https://github.com/ArctosDB/arctos/issues/1257

jldunnum commented 3 years ago

Dunnum, Jonathan L., Richard Yanagihara, Karl M. Johnson, Blas Armien, Nyamsuren Batsaikhan, Laura Morgan, and Joseph A. Cook. "Biospecimen repositories and integrated databases as critical infrastructure for pathogen discovery and pathobiology research." PLoS Neglected Tropical Diseases 11, no. 1 (2017): e0005133.


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Jorrit Poelen @.> Sent: Tuesday, April 6, 2021 11:26 AM To: ArctosDB/arctos @.> Cc: Jonathan Dunnum @.>; Mention @.> Subject: Re: [ArctosDB/arctos] [CONTACT] association type of the associated sequences related to host vouchers (e.g., https://arctos.database.museum/guid/MSB:Mamm:210229 https://www.ncbi.nlm.nih.gov/nuccore/EU241637) (#3550)

[EXTERNAL]

@jldunnumhttps://github.com/jldunnum thanks for sharing. Can you please share a citation string for the paper? Github issues does not keep the attachment alive.

Also, just in case y'all are feeling ambitious, I've attached a (partial) list of virus genbank numbers extract from indexed Grange et al. 2021 using elton interactions globalbioticinteractions/grange2021 | grep -P -o "https://[^\t]+nuccore[^\t]+" | sort | uniq > virus_genbank_numbers.txt

The first 10 are:

$ cat virus_genbank_numbers.txt | head https://www.ncbi.nlm.nih.gov/nuccore/AB010730 https://www.ncbi.nlm.nih.gov/nuccore/AB010731 https://www.ncbi.nlm.nih.gov/nuccore/AB010732 https://www.ncbi.nlm.nih.gov/nuccore/AB010733 https://www.ncbi.nlm.nih.gov/nuccore/AB010734 https://www.ncbi.nlm.nih.gov/nuccore/AB010735 https://www.ncbi.nlm.nih.gov/nuccore/AB010736 https://www.ncbi.nlm.nih.gov/nuccore/AB010737 https://www.ncbi.nlm.nih.gov/nuccore/AB010738 https://www.ncbi.nlm.nih.gov/nuccore/AB010739

virus_genbank_numbers.txthttps://github.com/ArctosDB/arctos/files/6266489/virus_genbank_numbers.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3550#issuecomment-814297079, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA27VL2GQAKUJL4DWZDTHM727ANCNFSM42PC4JWA.

campmlc commented 3 years ago

Great! I don't suppose it would be possible to identify GenBank accessions that have a non-mammalian organism or taxon name but an MSB:Mamm specimen voucher or LinkOut?

jldunnum commented 3 years ago

Another issue is that many pathogen/parasite papers that actually did cite a host used our field/tissue number "NK" and not our MSB catalog number.


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Mariel Campbell @.> Sent: Tuesday, April 6, 2021 11:30 AM To: ArctosDB/arctos @.> Cc: Jonathan Dunnum @.>; Mention @.> Subject: Re: [ArctosDB/arctos] [CONTACT] association type of the associated sequences related to host vouchers (e.g., https://arctos.database.museum/guid/MSB:Mamm:210229 https://www.ncbi.nlm.nih.gov/nuccore/EU241637) (#3550)

[EXTERNAL]

Great! I don't suppose it would be possible to identify GenBank accessions that have a non-mammalian organism or taxon name but an MSB:Mamm specimen voucher or LinkOut?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3550#issuecomment-814300326, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA2LYFUV3QNMAP24CTTTHNAM5ANCNFSM42PC4JWA.

jhpoelen commented 3 years ago

Great! I don't suppose it would be possible to identify GenBank accessions that have a non-mammalian organism or taxon name but an MSB:Mamm specimen voucher or LinkOut?

Given a little time, this can surely be done especially because of the excellent informatics resources that GenBank and Arctos provide. Also, already indexed datasets by GloBI already provide a starting point. https://github.com/globalbioticinteractions/virus-host-db comes to mind.

dustymc commented 3 years ago

MSB catalog number

https://en.wikipedia.org/wiki/Money_services_business - right?!?

We've been avoiding really embracing https://handbook.arctosdb.org/how_to/cite-specimens.html forever. "MSB 210229" (and the infinite variations thereof) could mean just about anything, and digging it out of a publication is never going to be foolproof. "https://arctos.database.museum/guid/MSB:Mamm:210229" and "http://dx.doi.org/10.7299/X7ZK5H0X" are completely unambiguous. Demanding those kinds of identifiers from users would eliminate any confusion going forward, and sort of accidentally save you a whole bunch of work (which might be redirected to dealing with the legacy stuff) in the process.

Given a little time

Yep! Arctos has an API, GenBank has an API, doing more in that intersection is just a matter of time. (I'm not sure sure about "little" though...)

campmlc commented 3 years ago

We have attempted "Demanding those kinds of identifiers" from GenBank as a required field/controlled vocab, most recently at the ASM meeting the summer before covid, but there still seems to be some reluctance or lack of awareness of the problem, at least from representatives designated to attend that meeting. There is also extreme reluctance to allow the collections that actually hold the specimens to make edits to fields that were incorrectly filled out by researchers submitted sequences.

On Tue, Apr 6, 2021 at 11:56 AM dustymc @.***> wrote:

  • [EXTERNAL]*

MSB catalog number

https://en.wikipedia.org/wiki/Money_services_business - right?!?

We've been avoiding really embracing https://handbook.arctosdb.org/how_to/cite-specimens.html forever. "MSB 210229" (and the infinite variations thereof) could mean just about anything, and digging it out of a publication is never going to be foolproof. "https://arctos.database.museum/guid/MSB:Mamm:210229" and " http://dx.doi.org/10.7299/X7ZK5H0X" are completely unambiguous. Demanding those kinds of identifiers from users would eliminate any confusion going forward, and sort of accidentally save you a whole bunch of work (which might be redirected to dealing with the legacy stuff) in the process.

Given a little time

Yep! Arctos has an API, GenBank has an API, doing more in that intersection is just a matter of time. (I'm not sure sure about "little" though...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3550#issuecomment-814320782, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBCX2TJCEEQ3TRC3AIDTHNDODANCNFSM42PC4JWA .

dustymc commented 3 years ago

from GenBank

I can understand their reluctance to that; it's not really their job, and most of the data they see won't ever have that level of information. I obviously don't KNOW anything, but I think this would be between loan-ers and loan-ees (and/or perhaps part of your internal licensing).

Arctos contains a genbank publisher tool (IDK if it's functional, it doesn't get any use so it doesn't get any attention) which completely eliminates any ambiguity there. It even deals with barcodes, so if you have those you can tie sequences to specific parts and not just catalog records.

GenBank is special in regard to identifiers; they are one of two systems (Arctos is the other) in which "MSB:Mamm:210229" is NOT ambiguous, because we worked out the specimen_voucher field and registry with them. 65 of the current 215 collections in Arctos claim to have registered with GenBank - we as a community could certainly do better.

I believe that everything we can currently do with GenBank was worked out with Scott Federhen, and not much has changed since he died. He at least was willing to allow edits by "owning institutions" if the submitter could not be convinced to make updates, I don't know if anyone else might be inclined to allow that or even who you'd ask. (I wonder if an agreement regarding future edits to GenBank might also be part of loan agreements?) Might be worth knocking on the door if you're ever in DC - we could certainly use another interested insider.

debpaul commented 3 years ago

Some GenBank fields moved from optional to highly recommended are coming to GenBank, and some new fields too, for specifying the connections between a host SEQ and the vouchered specimen and a related viral SEQ and Sample the viral SEQ came from. Stay tuned. Paper in progress. This work made possible by the #metadataregisteringpractices subgroup of the CETAF-DiSSCO Covid 19 Task Force. Pam Soltis at UF/iDigBio and Jerry Lanfear of ELIXIR can answer questions. See https://twitter.com/mcourtot/status/1376902192410603525 on Twitter and https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification for hints.

jldunnum commented 3 years ago

Thanks Deb


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Debbie Paul @.> Sent: Tuesday, April 6, 2021 12:35 PM To: ArctosDB/arctos @.> Cc: Jonathan Dunnum @.>; Mention @.> Subject: Re: [ArctosDB/arctos] [CONTACT] association type of the associated sequences related to host vouchers (e.g., https://arctos.database.museum/guid/MSB:Mamm:210229 https://www.ncbi.nlm.nih.gov/nuccore/EU241637) (#3550)

[EXTERNAL]

Some GenBank fields moved from optional to highly recommended are coming to GenBank, and some new fields too, for specifying the connections between a host SEQ and the vouchered specimen and a related viral SEQ and Sample the viral SEQ came from. Stay tuned. Paper in progress. This work made possible by the #metadataregisteringpractices subgroup of the CETAF-DiSSCO Covid 19 Task Force. Pam Soltis at UF/iDigBio and Jerry Lanfear of ELIXIR can answer questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/3550#issuecomment-814350720, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA5WPQU5CDC7NYBCW6DTHNH7ZANCNFSM42PC4JWA.

campmlc commented 3 years ago

That's good news!

On Tue, Apr 6, 2021, 12:35 PM Debbie Paul @.***> wrote:

  • [EXTERNAL]*

Some GenBank fields moved from optional to highly recommended are coming to GenBank, and some new fields too, for specifying the connections between a host SEQ and the vouchered specimen and a related viral SEQ and Sample the viral SEQ came from. Stay tuned. Paper in progress. This work made possible by the #metadataregisteringpractices subgroup of the CETAF-DiSSCO Covid 19 Task Force. Pam Soltis at UF/iDigBio and Jerry Lanfear of ELIXIR can answer questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3550#issuecomment-814350720, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBFBEYOAJS2JRGQ3FA3THNH7ZANCNFSM42PC4JWA .

jhpoelen commented 2 years ago

hey @mkoo - can you share why you are closing this issue?

mkoo commented 2 years ago

We're doing a clean up of old stale issues (more than 90- 100 days old). If there are still pending problems we need a new issue. If there;s new export lists to share a new issue would make sure people see it too. But let me know if I missed a specific to-do item for the Arctos dev list as I can reopen

jhpoelen commented 2 years ago

As far as I can tell these links from Arctos specimen to their sequences viruses are not yet indexed by GloBI. And, GenBank only seems to record the host name, no the voucher. I've created a separate issue at https://github.com/globalbioticinteractions/globalbioticinteractions/issues/755 .

jhpoelen commented 2 years ago

@mkoo Thanks for sharing the reason for closing this issue. I understand it is nice to cleanup issues as spring is just around the corner.

I do have to say that some of these issues may be old, but that doesn't mean they are stale in my mind, just something that just hasn't been addressed yet. I think it'd be a bummer to have these valuable ideas and observations disappear in a long list of closed issue.

Perhaps you have some ideas on how to keep track of these non-trivial but potentially innovative ideas to better link our data records and infrastructures.

mkoo commented 2 years ago

Yes I agree it's a danger to close -- but it's not deleted! I am trying to clean up with open issues by transferring issues to either an internal repo for more discussion and action, or reassigning issues to a different milestone, or adding labels to help us harvest open and closed issues for past discussions and ideas when searching on topics. Some get closed and I transfer their resources to working group teams. So a bunch of different tactics. Not sure if there's a universal solution but I am open to ideas!

Some actions are simply nagging a third party-- maybe I need a new label and project for that.....

jhpoelen commented 2 years ago

@mkoo I just opened a new issue in the GloBI issue tracker to keep this thread active, and noticed that GloBI knows about the linked genbank records, but the (valuable) connection to their Arctos specimen is not yet known. https://github.com/globalbioticinteractions/globalbioticinteractions/issues/755#issuecomment-1029509362 .

I agree that the issue is not deleted, but by marking it as "closed" is does seem to appear to have been resolved.

mkoo commented 2 years ago

ok I hear you, Jorrit! I'll reopen and change the issue title so it's clearer what's going on.

Reopening issue originally entitled: "[CONTACT] association type of the associated sequences related to host vouchers (e.g., https://arctos.database.museum/guid/MSB:Mamm:210229 https://www.ncbi.nlm.nih.gov/nuccore/EU241637)"

campmlc commented 2 years ago

As mentioned earlier, this would be an excellent task for an intern or graduate student, or possibly a findable grant proposal?

campmlc commented 2 years ago

@jhpoelen can you distinguish the direct references to specimen voucher in GenBank from the linkouts? We frequently use the latter to create relationships that the author failed to provide or provided incorrectly.

mkoo commented 2 years ago

well, it's not just viruses... there are broken or absent links everywhere in GenBank.. more genbank input is needed (they could fund the intern!)

jhpoelen commented 2 years ago

@campmlc I keep track of the source of the references, so I can distinguish them accordingly.

Question - is there any association rule I can apply to the Arctos -> GenBank relations.

E.g., all MSB specimen with genbank ids are host -> virus relations.

Or, all arctos specimen with genbank ids are host -> virus relations.

Or, all arctos specimen with genbank ids are host-virus relations only if the related genbank records notes the host name as same as arctos specimen classification.

Alternative, I can mark relations as "ecologically related to"

jhpoelen commented 2 years ago

ps. Money is probably better spend if they fund Arctos / GloBI ; )**

**Disclaimer, I am a contributor to GloBI...

debpaul commented 2 years ago

We do have some great contacts now for Elixir (Jerry Lanfear) and GenBank (via Ruth Timme) for working with them to make changes. See Thompson CW, Phelps KL, Allard MW, Cook JA, Dunnum JL, Ferguson AW, Gelang M, Khan FAA, Paul DL, Reeder DM, Simmons NB, Vanhove MPM, Webala PW, Weksler M, Kilpatrick CW. Preserve a Voucher Specimen! The Critical Need for Integrating Natural History Collections in Infectious Disease Studies. mBio. 2021 Jan 12;12(1):e02698-20. doi: 10.1128/mBio.02698-20. PMID: 33436435; PMCID: PMC7844540.

jhpoelen commented 2 years ago

@debpaul Great! What can they contribute to solving this issue?

I also noted your earlier comment from about a year ago

Some GenBank fields moved from optional to highly recommended are coming to GenBank, and some new fields too, for specifying the connections between a host SEQ and the vouchered specimen and a related viral SEQ and Sample the viral SEQ came from. Stay tuned. Paper in progress. This work made possible by the #metadataregisteringpractices subgroup of the CETAF-DiSSCO Covid 19 Task Force. Pam Soltis at UF/iDigBio and Jerry Lanfear of ELIXIR can answer questions. See https://twitter.com/mcourtot/status/1376902192410603525 on Twitter and https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification for hints.

Anything change since then?

debpaul commented 2 years ago

Anything change since then?

@jhpoelen what changed was two things (loosely speaking). One was more fields moved from optional to recommended AND a change in finding the connections to people who are willing to sit at the table to discuss / work on needed changes. Of course there's much work to be done at different levels. But good to keep these partners in the loop to make maximum impact and work on inclusion.

jhpoelen commented 2 years ago

@debpaul thanks for elaborating. Sounds like the stars are aligning.

campmlc commented 2 years ago

@jhpoelen @jldunnum - following up on this. Regarding below,

Question - is there any association rule I can apply to the Arctos -> GenBank relations. Possibly the following? all arctos specimen with genbank ids are host-virus relations only if the related genbank records are viral sequences and the specimen voucher or host field refers to a different taxonomic group? This is awkward - you would need to know that "MSB:Mamm" is a mammal collection, etc.

From the Arctos end, we can find a lot of these via a search on "symbiotype" - but this does not distinguish symbiotype of "what taxon". @dustymc we have previously discussed some way of allowing a taxon name to be entered into the "symbiotype of" field - right now, it just refers to the host, not the parasite or pathogen. Ideas to fix?

In the meantime, I'm going through the symbiotype records with links to GenBank and adding the "host of" references, which should give @jhpoelen something to start with. First example: https://arctos.database.museum/guid/MSB:Mamm:148558

campmlc commented 2 years ago

Also @dustymc note that the reciprocal linkouts for the GenBank virus sequences are still not working in this example.

campmlc commented 2 years ago

Here is another with relationships added. @jhpoelen can you use these examples to find others? https://arctos.database.museum/guid/MSB:Mamm:148794

campmlc commented 2 years ago

I just created this relationship: https://arctos.database.museum/guid/MSB:Mamm:135531 with a taxon name (new species) as an OrganismID. Should actually change to a "host of" relationship to proposed new field "TaxonID" which would link to the taxonomy table. The publication of this n.sp. did not give the HWML catalog numbers - I will try to track them down. But using TaxonID would allow linkage between a catalog record and a taxonomic name, which could help solve the problem of the symbiotype relationship mentioned above. Possible?

dustymc commented 2 years ago

Ideas to fix?

Catalog the stuff that seems to be important and make the correct assertions.

debpaul commented 2 years ago

@campmlc you wrote:

Yes, it is unfortunate that the virus community is not better at providing cataloged "voucher specimens" that we can link to. It has been very difficult to get most virologists to identify, designate, or archive a host voucher, or even when these exist, to link to them on GenBank.

Are you intrigued by the idea of a panel discussion / webinar about the above topic with members of the virus community joining us? We could discuss changes (that occurred as a result of Covid) and changes in standard-of-practice still needed or needing to be adopted -- both by collections and virologists? Pam Soltis and I could possibly arrange such a thing.

campmlc commented 2 years ago

@debpaul Yes, that would be fantastic! @jldunnum

campmlc commented 2 years ago

Ideas to fix?

Catalog the stuff that seems to be important and make the correct assertions.

@dustymc this would require we create new virus collections, fungal collections, bacterial collections etc for things we do not have vouchers for, in order to say that this "host" record is related to this "pathogen" record". And which institution will manage these? Right now we can do this for our integrated host and parasite collections at the institutional level, if we add in all the taxonomy (big can of worms, there), but what about things in external repositories? At a minimum, we need to be able to say this "host" was tested for this "pathogen" by this method/citation on this date with results positive/negative and quantitative values of results. We could do this with specimen attributes, or possibly part attributes, or maybe a separate "tested for" module, but we would still need the taxonomy linked here if possible. That was my suggestion above.

dustymc commented 2 years ago

If those things exist then of course they can be linked to.

If they don't but structured data are necessary, a Host collection could be used. That's of course more work for all the reasons you point out, but I don't think there's a lesser cost which leads to those kinds of data.

If structured data aren't critical (or critical enough to inspire someone to manage a Host collection, anyway!), then things like verbatim host ID provide a text-based alternative.

I don't think any amount of shoehorning will much change that, but it might break other things.

jhpoelen commented 2 years ago

Hey y'all - coming to the conversation a bit late, but please note that GloBI is now resolving the ncbi records as reported in the arctos records. This means that GloBI also pulls in the taxonomic information (and more) from the NCBI genbank records and enables taxonomic searches for either host or hostee .

E.g.,

https://arctos.database.museum/guid/MSB:Mamm:148794

https://www.globalbioticinteractions.org/?accordingTo=http%3A%2F%2Farctos.database.museum%2Fguid%2FMSB%3AMamm%3A148794&interactionType=interactsWith

has already been indexed by GloBI (see attached screenshots).

Screenshot from 2022-05-02 13-26-57 Screenshot from 2022-05-02 13-26-44

For this specific example, you can find specimen to specimen links via "download csv sample" link or

https://api.globalbioticinteractions.org/interaction.csv?type=csv&interactionType=interactsWith&accordingTo=http%3A%2F%2Farctos.database.museum%2Fguid%2FMSB%3AMamm%3A148794&limit=4096&offset=0&refutes=false&includeObservations=true&field=source_taxon_id&field=source_taxon_name&field=source_taxon_path&field=source_taxon_path_ids&field=source_specimen_occurrence_id&field=source_specimen_institution_code&field=source_specimen_collection_code&field=source_specimen_catalog_number&field=source_specimen_life_stage_id&field=source_specimen_life_stage&field=source_specimen_physiological_state_id&field=source_specimen_physiological_state&field=source_specimen_body_part_id&field=source_specimen_body_part&field=source_specimen_sex_id&field=source_specimen_sex&field=source_specimen_basis_of_record&field=interaction_type&field=target_taxon_id&field=target_taxon_name&field=target_taxon_path&field=target_taxon_path_ids&field=target_specimen_occurrence_id&field=target_specimen_institution_code&field=target_specimen_collection_code&field=target_specimen_catalog_number&field=target_specimen_life_stage_id&field=target_specimen_life_stage&field=target_specimen_physiological_state_id&field=target_specimen_physiological_state&field=target_specimen_body_part_id&field=target_specimen_body_part&field=target_specimen_sex_id&field=target_specimen_sex&field=target_specimen_basis_of_record&field=latitude&field=longitude&field=collection_time_in_unix_epoch&field=study_citation&field=study_url&field=study_source_citation&field=study_source_archive_uri .

from arctos-genbank-link.csv

source_taxon_name source_taxon_path source_taxon_path_ids source_specimen_occurrence_id source_specimen_institution_code source_specimen_collection_code source_specimen_catalog_number source_specimen_life_stage_id source_specimen_life_stage source_specimen_physiological_state_id source_specimen_physiological_state source_specimen_body_part_id source_specimen_body_part source_specimen_sex_id source_specimen_sex source_specimen_basis_of_record interaction_type target_taxon_name target_taxon_path target_taxon_path_ids target_specimen_occurrence_id target_specimen_institution_code target_specimen_collection_code target_specimen_catalog_number target_specimen_life_stage_id target_specimen_life_stage target_specimen_physiological_state_id target_specimen_physiological_state target_specimen_body_part_id target_specimen_body_part target_specimen_sex_id target_specimen_sex target_specimen_basis_of_record latitude longitude event_date study_citation study_url study_source_citation study_source_archive_uri
Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen hostOf Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306150                       PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen hostOf Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306149                       PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306148                       PreservedSpecimen hasHost Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306150                       PreservedSpecimen hasHost Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen hostOf Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306148                       PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
Kenkeme virus root | Viruses | Riboviria | Orthornavirae | Negarnaviricota | Polyploviricotina | Ellioviricetes | Bunyavirales | Hantaviridae | Mammantavirinae | Orthohantavirus | Kenkeme orthohantavirus | Kenkeme virus NCBI:1 | NCBI:10239 | NCBI:2559587 | NCBI:2732396 | NCBI:2497569 | NCBI:2497571 | NCBI:2497576 | NCBI:1980410 | NCBI:1980413 | NCBI:2560074 | NCBI:1980442 | NCBI:1980474 | NCBI:765147 http://www.ncbi.nlm.nih.gov/nuccore/GQ306149                       PreservedSpecimen hasHost Sorex roboratus Animalia | Chordata | Mammalia | Soricomorpha | Soricidae | Sorex | Sorex roboratus EOL:1 | EOL:694 | EOL:1642 | EOL:8711 | EOL:8714 | EOL:10807 | EOL:323674 http://arctos.database.museum/guid/MSB:Mamm:148794?seid=40658 MSB Mamm MSB:Mamm:148794   young       lung   male PreservedSpecimen 62.07003 128.93831 2006-08-19T00:00:00Z http://arctos.database.museum/guid/MSB:Mamm:148794 http://arctos.database.museum/guid/MSB:Mamm:148794 Natural History Collections managed by Arctos (https://arctosdb.org) accessed via https://vertnet.org . https://github.com/globalbioticinteractions/vertnet/archive/411bd21192e50ddccd51381a731444f74b032ffb.zip
jhpoelen commented 2 years ago

To find all arctos - genbank links known to GloBI, you could use something like;

$ curl https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/interactions.tsv.gz\
 | gunzip\
 | grep 'arctos[.]database'\
 | grep nuccore\
 | tee arctos-genbank-links.tsv

which is bash linux speak for saying: get me the latest indexed interactions via GloBI's interactions.tsv. Then select only rows that contain "arctos.database" and "nuccore" terms. Finally, put the results in the file arctos-genbank-links.tsv .

jhpoelen commented 2 years ago

According to recent interactions.tsv, this yield 444 interaction claims. See attached zip for csv/tsv versions of these claims.

arctos-genbank-links.zip

Curious to hear whether this is at all useful.

campmlc commented 2 years ago

Fantastic! Thanks @jhpoelen ! I'll look over this list and see what else we can add.

debpaul commented 2 years ago

To find all arctos - genbank links known to GloBI, you could use something like;

$ curl https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/interactions.tsv.gz\
 | gunzip\
 | grep 'arctos[.]database'\
 | grep nuccore\
 | tee arctos-genbank-links.tsv

which is bash linux speak for saying: get me the latest indexed interactions via GloBI's interactions.tsv. Then select only rows that contain "arctos.database" and "nuccore" terms. Finally, put the results in the file arctos-genbank-links.tsv .

@jhpoelen may i say how much i love the above "translation" Thank you!

jhpoelen commented 2 years ago

@debpaul you are welcome! Please do let me know if other things need translating.