ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
61 stars 13 forks source link

Code Table Request - Genome ID #3652

Closed campmlc closed 1 year ago

campmlc commented 3 years ago

[ Code Table Documentation is https://handbook.arctosdb.org/how_to/How-to-Use-Code-Tables.html ]

Goal Make it possible to find genomes through a search of OtherIDs = Genome ID

Context The genomics research community has no centralized repository for whole genomes, and currently genome data may be entered and accessible through a variety of differernt portals with differing levels of consistency and permanency in their urls. These include NCBI Assemblies, Biosamples, and other resources. Quotes from researchers asked about this: "NCBI is a pain, but if I were to be searching for a reference genome I would search in the assemblies database as these are unique to an individual sample and experiment. " "I'd used NCBI Assembly, NCBI BioSample, and NCBI BioProject as key terms for NCBI-associated genomic data. Honestly I archive my data with NCBI through SRA, but I use ENA to query/search for genomes and they use "Study", "Experiment", "Run", "Submission", "Accession", and "Taxon" IDs to identify genomes. You could integrate those labels as "ENA Study #", "ENA Experiment #" etc. or just link to "Genomic reads" or "Complete or partial genome assemblies". Raw reads are typically more valuable for reproducing or extending genomic research, whereas assembled genomes are used for reference-guided mapping assemblies. NCBI SRA numbers are included in ENA as "Submission" IDs. Here's an example and the reads for that example."

Given the current confusion, Arctos could provide identifiers for each of these links independently, but a researcher would have to know a priori which to search on or search for an increasingly longer list of potential urls. We should certainly add these as OtherIDs - later issue. But this request is to add an identifier = Genome ID where any possible link to genomic data could be entered, and which would allow researchers to search on a single identifier to locate any possible genomic info across a variety of platforms. This would have to be free -text, and of course prone to error, which is why adding the other identifiers with real linkable urls to the record is advisable. This ID is primarily a search tool or flag that such info exists.

Table https://arctos.database.museum/Admin/CodeTableEditor.cfm?action=editCollOIDT&tbl=ctcoll_other_id_type]

Value Genome ID

Definition An identifier, preferably a url, which references the external repository for genomic data for this record.

Collection type Mamm, Bird, Herp, Amph, Rept, Fish, Ento, Inv, Para, Env, Herb, Mala, Zoo

Attribute data type free text

Part tissue flag yes

Priority Very High

Jegelewicz commented 3 years ago

You know what would be nice? To have this ID be magically populated by any other "genome" ID that gets added....OR maybe we just need a flag in the code table "this other ID is a genome" so that anyone could search across all of them?

Sorry to throw a wrench in! None of the above makes this addition a bad idea - just thinking that it could be magical....

campmlc commented 3 years ago

I absolutely agree with a "genome" flag . . . possible? Or should we just move forward with this for now to make something that can work.

On Wed, Jun 9, 2021 at 3:31 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

You know what would be nice? To have this ID be magically populated by any other "genome" ID that gets added....OR maybe we just need a flag in the code table "this other ID is a genome" so that anyone could search across all of them?

Sorry to throw a wrench in! None of the above makes this addition a bad idea - just thinking that it could be magical....

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-858116644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBO3LA3NNOBOBAQBJLTR7MURANCNFSM46M2QEGQ .

campmlc commented 3 years ago

But we still need a way in the interface to search for "genomic data"

On Wed, Jun 9, 2021 at 3:32 PM Mariel Campbell @.***> wrote:

I absolutely agree with a "genome" flag . . . possible? Or should we just move forward with this for now to make something that can work.

On Wed, Jun 9, 2021 at 3:31 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

You know what would be nice? To have this ID be magically populated by any other "genome" ID that gets added....OR maybe we just need a flag in the code table "this other ID is a genome" so that anyone could search across all of them?

Sorry to throw a wrench in! None of the above makes this addition a bad idea - just thinking that it could be magical....

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-858116644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBO3LA3NNOBOBAQBJLTR7MURANCNFSM46M2QEGQ .

Jegelewicz commented 3 years ago

But we still need a way in the interface to search for "genomic data"

I think the ID proposed in the issue would do that IF it is consistently applied (EVERY record with a current GenBank ID ALSO gets one of these). Which seems like duplication of effort. AND people searching KNOW to search for that particular OtherID, which is highly unlikely. If we can just flag otherIDs in the code table as "genomic", then the work is done for us and IDs only need to be recorded once. Add "only search records with genomic identifiers" (like the require tissues button) and you get what you want.

campmlc commented 3 years ago

Can we put "Find all records with tissues", "Find all records with genomic data", "Find all records with sequence data" into some obvious search place, like in the Catalog Record box on search, but visible without "show more options"? Not just a tiny little check box hiding at top of page only for people who know where to look?

On Wed, Jun 9, 2021 at 3:41 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

But we still need a way in the interface to search for "genomic data"

I think the ID proposed in the issue would do that IF it is consistently applied (EVERY record with a current GenBank ID ALSO gets one of these). Which seems like duplication of effort. AND people searching KNOW to search for that particular OtherID, which is highly unlikely. If we can just flag otherIDs in the code table as "genomic", then the work is done for us and IDs only need to be recorded once. Add "only search records with genomic identifiers" (like the require tissues button) and you get what you want.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-858121137, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDBTWOFKRG2XPXBJN3TR7NXFANCNFSM46M2QEGQ .

dustymc commented 3 years ago

The objectives are not clear, or perhaps have shifted. I'm not sure if this is a UI issue or a data issue.

One of the proposed solutions is not consistent with https://github.com/ArctosDB/arctos/issues/3593, while it seems that the data are mostly identical (there's an external resource of a certain type but in no particular place or format indicating a particular type of usage).

https://www.ncbi.nlm.nih.gov/genome/ exists but I have no idea how it ties in here.

I am adamantly opposed to any denormalization. "EVERY record with a current GenBank ID ALSO gets one of these" will simply not happen, cannot be necessary, and inevitably results in users finding only partial datasets.

KyndallH commented 3 years ago

I agree with having a flag that tags individuals with genetic data. I do not want "genomic id" as an ID since that is so vague. I know it adds more on the id list but I want "Genbank", "NCBI BioSample", "BoLD", "Sequence Read Archive", and all the future ways they identify genetic information on outside databases.

dustymc commented 3 years ago

but I want "Genbank",

https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#genbank

"NCBI BioSample",

https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#biosample

BoLD

https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#bold_barcode_id

"Sequence Read Archive"

https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#ncbi_sequence_read_archive_run_id

One possibly stupid idea: group those by adding some common prefix ("GenBank" becomes "genetic junk: GenBank"). We've done something similar with other data (eg https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#cmnh__carnegie_museum_of_natural_history), so that's not an entirely new flavor of weird. The search is (and probably will remain) a select multiple, users can just pick all options they're interested in. (They can do that now, but they're scattered out.)

Alternate maybe equally stupid idea: The code table has a sort order column, it could also group those things - but the not-so-alphabetical sort makes me twitchy.

KyndallH commented 3 years ago

Oh, I know they already exist! And we use them! From what I'm understanding from the discussion is that they want to get rid of those for a "Genomic ID" identifier to make searching for the data easier. I prefer the more descriptive identifiers.

dustymc commented 3 years ago

get rid of those for a "Genomic ID" identifier to make searching for the data easier.

Oh - yea, that would make things like creating the reciprocals on genbank somewhere between painful and impossible, I'm not a fan.

Jegelewicz commented 3 years ago

Alternate maybe equally stupid idea: The code table has a sort order column, it could also group those things - but the not-so-alphabetical sort makes me twitchy.

Radical idea - add a column to the code table, "Other ID group". I bet that there are other things that could be grouped together for purposes like this. For instance, MSB could group NK with all of their other "MSB" type identifiers.

campmlc commented 3 years ago

I agree we need some way to " require genomes" in the same way we " require tissues" or find vouchers.

On Thu, Jun 24, 2021, 3:33 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

Alternate maybe equally stupid idea: The code table has a sort order column, it could also group those things - but the not-so-alphabetical sort makes me twitchy.

Radical idea - add a column to the code table, "Other ID group". I bet that there are other things that could be grouped together for purposes like this. For instance, MSB could group NK with all of their other "MSB" type identifiers.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-867899124, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHQ5T7DIWA7XMU4TZLTUOB73ANCNFSM46M2QEGQ .

acdoll commented 3 years ago

add a column to the code table, "Other ID group".

I like this idea. That Other Identifier Type list is getting pretty unwieldly (fortunately most of mine are near the top). Group ideas: General object IDs (collector number, field number, ear tag...) Arctos Institution IDs (internal IDs used by our collections) - would this need a subgroup for each institution? Extraneous Institution IDs (IDs used by non-Arctos institutions: rehab centers, government agency IDs...) Online data repositories/aggreagtors (GBIF, Dryad, Genbank...)

Jegelewicz commented 3 years ago

Arctos Institution IDs (internal IDs used by our collections) - would this need a subgroup for each institution?

I would skip this and just set up the institutional groups.

Online data repositories/aggreagtors (GBIF, Dryad, Genbank...)

defeats the purpose of putting all of the "genome" ids together but maybe we need to be able to assign IDs to multiple groups? Are we going overboard there?

dustymc commented 3 years ago

add a column

Given the uses of this, how's that functionally different than an embedded prefix?

Or are there uses beyond "pick from the list..."?

Are we saying that identifier types are somehow data objects in their own right, or is this some UI-thing, or ????

General object IDs (collector number, field number, ear tag...)

I'd not lump field number in there - it's (usually) for a different kind of thing (lot, sorta-I-think, rather than item).

(internal IDs used by our collections) - would this need a subgroup for each institution?

  1. Some of those are functionally pre-printed collector numbers
  2. Some of them find their way across institutions (due to collaborative projects and etc.)

I'm not seeing clear categories in the data, adding arbitrary classifications seems like it would just add confusion. "This is an MSB number" and it's attached to a DMNS record and users pull their hair out and run away screaming.....

Jegelewicz commented 3 years ago

identifier types are somehow data objects in their own right

I believe so, but I could be convinced that I am wrong

I'm not seeing clear categories in the data

Here is one - these are all identifiers ISSUED by Museum of Southwestern Biology. All but one of them group together. I don't think MSB will want to change NK to MSB:NK, but I could be wrong. @campmlc

ID Definition url
NK [ link ] "New Mexico Karytoype Number," a frozen tissue collection number for the Museum of Southwestern Biology.
MSB:Arth [ link ] Museum of Southwestern Biology, University of New Mexico, Arthropod Collection catalog number.    
MSB:Bird [ link ] Museum of Southwestern Biology, University of New Mexico, Bird Collection catalog number http://arctos.database.museum/guid/MSB:Bird:  
MSB:Fish [ link ] Museum of Southwestern Biology, University of New Mexico, Fish Collection catalog number. http://arctos.database.museum/guid/MSB:Fish:  
MSB Fish Lot ID [ link ] Museum of Southwestern Biology, University of New Mexico, Fish Collection lot identifier.    
MSB:Herp [ link ] Museum of Southwestern Biology, University of New Mexico, Herpetology Collection catalog number http://arctos.database.museum/guid/MSB:Herp:  
MSB:Host [ link ] Museum of Southwestern Biology, University of New Mexico, Host Collection catalog number http://arctos.database.museum/guid/MSB:Host:  
MSB:Inv [ link ] Museum of Southwestern Biology, University of New Mexico, Invertebrate Collection catalog number http://arctos.database.museum/guid/MSB:Inv:  
MSB:Mamm [ link ] Museum of Southwestern Biology, University of New Mexico, Mammal Collection catalog number http://arctos.database.museum/guid/MSB:Mamm:  
MSB: Museum of Southwestern Biology [ link ] Museum of Southwestern Biology, Albuquerque, New Mexico. Arctos Agent    
MSBObs:Mamm [ link ] Museum of Southwestern Biology, University of New Mexico, Mammal Observation Collection catalog number http://arctos.database.museum/guid/MSBObs:Mamm:  
MSB:Para [ link ] Museum of Southwestern Biology, University of New Mexico, Parasite Collection catalog number http://arctos.database.museum/guid/MSB:Para:  

dustymc commented 3 years ago

ISSUED by

I don't think that can address the original request.

MSB:NK

They (or someone in a similar situation) won't, and that's maybe even weirder for that not-MSB record that's wearing an "MSB-issued" NK because someone ran off with a data sheet....

Jegelewicz commented 3 years ago

There are tissues in Alaska that have NK numbers because they came from MSB

image

Jegelewicz commented 3 years ago

I don't think that can address the original request.

Why not? We can group these IDs in many different ways. By issuer, by type (GenomeID), and I'm sure there are other ways we might think of in the future.....

dustymc commented 3 years ago

group these IDs in many different ways

That is not compatible with "add a column"!

https://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html#issue-protips

So, we end up with a whole bunch of THINGS hanging off of "NK" - now what? I'm not seeing how that's useful. I have to select a category first, or ??

And I'm still not seeing how this addresses the original - I think some (Most? All?) of these resources (eg ID types) have both whole genome and other data; certainly the github.io URL from the original represents more than genomes.

Jegelewicz commented 3 years ago

Which is maybe this?

identifier types are somehow data objects in their own right

I believe so, but I could be convinced that I am wrong

I think we are asking a lot of these things and maybe we should be looking at them as more complex entities than we do now. See also #2847 #2216 #1902 and maybe some I am missing?

campmlc commented 3 years ago

It would be good if identifiers had determiners and dates and remarks, for example, to tie a GenBank number to a particular citation. Or even better, if we could tie the GenBank number to the actual tissue part that was sampled - which means cataloging MaterialSamples

On Fri, Jun 25, 2021 at 10:11 AM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

Which is maybe this?

identifier types are somehow data objects in their own right

I believe so, but I could be convinced that I am wrong

I think we are asking a lot of these things and maybe we should be looking at them as more complex entities than we do now. See also #2847 https://github.com/ArctosDB/arctos/issues/2847 #2216 https://github.com/ArctosDB/arctos/issues/2216 #1902 https://github.com/ArctosDB/arctos/issues/1902 and maybe some I am missing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-868673452, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDEXYB3SEIWQ6FSER3TUSTBPANCNFSM46M2QEGQ .

dustymc commented 3 years ago

Some of those discussions are in regards to assertions that use the types, and some in regards to the types themselves. I think those are very different things (think taxonomy vs. identification) and that those discussions should not be confounded with each other, but I'm also open to the idea that we should be doing something radically different.

That's probably best discussed in a new/dedicated issue, BUT....

I'm sorta wondering if we need types at all. GenBank number (and maybe lots more) isn't necessarily a homogeneous thing, it's just a common place (url, API endpoint, format, etc. - maybe those are attributes of the type after all....) to store a fairly broad category of data. I'm not sure that any label we apply to the type (and so to all assertions using the type) can be adequate. Maybe we need some way to say "the data are at GenBank, and this is a [mitochondrial | whole genome | whatever] sequence" or "this NK number is a squirrel that some grad student ran over and dumped in the local museum, it has nothing to do with MSB or New Mexico or karyotypes."

That discussion really needs to start with big-picture goals; I'm not sure that sniping at the current model is going to lead anywhere useful. What, not how (for now), do we want to do with identifiers?

tie a GenBank number to a particular citation

https://github.com/ArctosDB/arctos/issues/1257

which means cataloging MaterialSamples

No....

KyndallH commented 3 years ago

I think a flag would be ideal plus a heck of a lot simpler than grouping all the different identifiers we have. The point is to be able to search for specimens with genetic data (has a Genbank, BioSample, BoLD number, etc.) without having to select all the different IDs under Other ID.

@campmlc Though are you wanting to find records that just have GENOMIC data (whole genome sequencing) or any genetic data (partial cytb)?

campmlc commented 3 years ago

The original request was to be able to find specimens with whole or partial genomes. Whatever tool could also be used to flag specimens that have CT scans, for example, or other future data categories that may have multiple different identifiers or urls to related repositories.

On Fri, Jun 25, 2021 at 11:42 AM Kyndall Hildebrandt < @.***> wrote:

  • [EXTERNAL]*

I think a flag would be ideal plus a heck of a lot simpler than grouping all the different identifiers we have. The point is to be able to search for specimens with genetic data (has a Genbank, BioSample, BoLD number, etc.) without having to select all the different IDs under Other ID.

@campmlc https://github.com/campmlc Though are you wanting to find records that just have GENOMIC data (whole genome sequencing) or any genetic data (partial cytb)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-868729672, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBCLCNRRWQZFMMUQMS3TUS5YXANCNFSM46M2QEGQ .

dustymc commented 3 years ago

(has a Genbank, BioSample, BoLD number, etc.)

That could be "just UI" - it's not great (updating the list used in the query and updating the identifiers would be completely separate, for example) but I think it's workable.

flag specimens that have CT scans,

https://github.com/ArctosDB/arctos/issues/3652#issuecomment-858714163

KyndallH commented 3 years ago

"whole or partial genomes" so in my opinion, this request would exclude Genbank numbers. Yes or no?

"just UI" makes it sound easy.

dustymc commented 3 years ago

"just UI" makes it sound easy.

Yep, I'd just need

campmlc commented 3 years ago

The idea was that a genome flag would be distinct from a GenBank flag. But if we have the option of a variety of flags, we could have an "ncbi" flag or even more specific - nucleotide, protein, or even specific gene, cytb or CO1. We could have "any genetic info" flag . . . That is, if it is easy to do this in the UI

On Fri, Jun 25, 2021 at 11:58 AM Kyndall Hildebrandt < @.***> wrote:

  • [EXTERNAL]*

"whole or partial genomes" so in my opinion, this request would exclude Genbank numbers. Yes or no?

"just UI" makes it sound easy.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3652#issuecomment-868738129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBE6WJ3ZB46PEDWX3TTUS7UXANCNFSM46M2QEGQ .

dustymc commented 3 years ago

nucleotide, protein, or even specific gene, cytb or CO1. We could have "any genetic info" flag . . .

That's not UI, that's something on the order of https://github.com/ArctosDB/arctos/issues/3652#issuecomment-868703425

Jegelewicz commented 3 years ago

tie the GenBank number to the actual tissue part that was sampled - which means cataloging MaterialSamples

see https://github.com/ArctosDB/arctos/issues/3630#issuecomment-868693446

dustymc commented 2 years ago

Maybe this is better addressed in https://github.com/ArctosDB/arctos/issues/4101? https://github.com/ArctosDB/arctos/issues/3630 is definitely related. Both seem abandoned.

If there's something actionable in this, please clarify. If not, please close.

Jegelewicz commented 2 years ago

This kinda seems like a saved search? Which identifiers make something have a "Genome ID"? Create the Arctos wide search, save it with a name, modify it if new identifiers show up, somehow share it from the main search page?