ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Linking to whole genome identifiers #3068

Closed ccicero closed 2 years ago

ccicero commented 4 years ago

@Jegelewicz @campmlc @dustymc

I am trying to link MVZ:Bird:191951 to its genome sequence.

The full specifications and sequence are available at https://vgp.github.io/genomeark/Chiroxiphia_lanceolata/ Can/should we add GenomeArk as an other identifier? If not, how else to do this?

Also, there are links to primary and alternate assemblies in GenBank. https://www.ncbi.nlm.nih.gov/assembly/GCA_009829145.1 https://www.ncbi.nlm.nih.gov/assembly/GCA_009829205.1

I entered the URLs in Arctos in the prefix/string field for other IDs, but they are not resolving to the correct pages. What am I doing wrong? https://arctos.database.museum/guid/MVZ:Bird:191951

campmlc commented 4 years ago

Interesting - these are in Assemblies - not in Nucleotide. I wonder if the repository makes a difference?

I support adding GenomeArk as an identifier. Do we have a contact there so we can see if we can set up the reciprocal?

On Mon, Aug 31, 2020 at 6:23 PM Carla Cicero notifications@github.com wrote:

  • [EXTERNAL]*

@Jegelewicz https://github.com/Jegelewicz @campmlc https://github.com/campmlc @dustymc https://github.com/dustymc

I am trying to link MVZ:Bird:191951 to its genome sequence.

The full specifications and sequence are available at https://vgp.github.io/genomeark/Chiroxiphia_lanceolata/ Can/should we add GenomeArk as an other identifier? If not, how else to do this?

Also, there are links to primary and alternate assemblies in GenBank. https://www.ncbi.nlm.nih.gov/assembly/GCA_009829145.1 https://www.ncbi.nlm.nih.gov/assembly/GCA_009829205.1

I entered the URLs in Arctos in the prefix/string field for other IDs, but they are not resolving to the correct pages. What am I doing wrong? https://arctos.database.museum/guid/MVZ:Bird:191951

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBEI5ZUU5XJHTDMNV7DSDQ5I7ANCNFSM4QRC4BEQ .

Jegelewicz commented 4 years ago

I wonder if the repository makes a difference?

It does. The base URL for Genbank nucleotides is http://www.ncbi.nlm.nih.gov/nuccore/ but for these assemblages it is https://www.ncbi.nlm.nih.gov/assembly/ (I think - but I'll need to double check that.)

Jegelewicz commented 4 years ago

I don't know what an "assemblage" is, so I don't know if we should create a new other ID for it.

The Assembly database has information about the structure of assembled genomes as represented in an AGP file or as a collection of completely sequenced chromosomes. The database provides a versioned Assembly accession number that tracks changes to assemblies as they are updated by submitting groups over time. The web resource provides meta-data about assemblies such as assembly names (and alternate names), simple statistical reports of the assembly (type and number of contigs, scaffolds; N50s) and a history view of updates. It also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Collaboration ( INSDC ), i.e. DDBJ , ENA or GenBank , and the assembly represented in the NCBI Reference Sequence (RefSeq) project.

Doesn't sound like "this stuff came from this specimen", but I am not sure....

ccicero commented 4 years ago

I don't understand how this works, and don't know anyone at GenomeArk but it would be worth investigating. I just know that we have the voucher and tissue for this genome, so we need to link it somehow.

campmlc commented 4 years ago

An assemblage is the reconstruction of the order an entire genome, not just the sequence of a single gene, and may include gene positions on chromosomes. It is lots of different subfiles, its own database of sorts. Perhaps add both "GenomeArk" identifier and also a "Genome/Assemblage ID" that could be other databases . . .? I'll ask around here.

On Mon, Aug 31, 2020 at 6:42 PM Carla Cicero notifications@github.com wrote:

  • [EXTERNAL]*

I don't understand how this works, and don't know anyone at GenomeArk but it would be worth investigating. I just know that we have the voucher and tissue for this genome, so we need to link it somehow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-684123036, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBB5SGG43TNPBMJFI3SDQ7O7ANCNFSM4QRC4BEQ .

dustymc commented 4 years ago

I greatly dislike https://vgp.github.io/genomeark/Chiroxiphia_lanceolata/ as an OtherID - it's a weird github-generated URL with no specific identifiers. Strongly suggest using Media to link.

All base-url-having IDs want the "suffix" as the identifier, not the whole URL. I'm still not quite sure what a genbank assemblage is, maybe we should have it as a type. Sorta looks like another view of https://www.ncbi.nlm.nih.gov/biosample/SAMN12620979/, which is a type, to me, but ???

I can't find a "normal" genbank accession for MVZ:Bird:191951 - should there be?

ccicero commented 3 years ago

I'm revisiting this issue so we can figure out how to properly link vouchers to reference genomes. There are three issues here.

1) How do we link to the whole reference genome information on GenomeArk e.g., https://vgp.github.io/genomeark/Chiroxiphia_lanceolata/

I added a media per Dusty's suggestsion, but the description doesn't show so one would not know that it's a reference genome. We need a better way of dislaying that.

2) How do we link to the assemblies (primary and alternate - not sure what the diff is)? https://www.ncbi.nlm.nih.gov/assembly/GCA_009829145.1 https://www.ncbi.nlm.nih.gov/assembly/GCA_009829205.1

3) Should we also link to the sequences on individual chromosomes? In this case there are 35, here's the first one: e.g., https://www.ncbi.nlm.nih.gov/nuccore/CM020533.1

Discussion for an issues meeting?

ewommack commented 3 years ago

What about to the whole genome info (https://www.ncbi.nlm.nih.gov/genome/?term=txid296741[orgn])? That seems to have links to each assembly and the chromosomes.

Can you share what the Arctos entry looks like with the link?

And feel free to add to the AWG Issues meeting. The draft agenda is made up in the Arctos Shared folder. Let me know if you need a link.

ccicero commented 3 years ago

I didn't try that, but I'm not sure how to link that either.

The MVZ specimen is https://arctos.database.museum/guid/MVZ:Bird:191951

I'm trying different ways but can't get it to work.

Also, I noted that there's not reference back to the voucher! I will work on that once I figure out how we best link to the GenBank data.

dustymc commented 3 years ago

I'm certainly more comfortable with another NCBI identifier than with some random temporary instaurl.

Is https://www.ncbi.nlm.nih.gov/genome/?term=txid296741%5Borgn%5D an "acquire dead bird bird, create thing" situation, or is that some sort of "project" involving components created from dead birds?

@ccicero please undo whatever you've done with the genbank links on https://arctos.database.museum/guid/MVZ:Bird:191951 - I do not sanity check those data, and if we ship that to GenBank they will complain.

ccicero commented 3 years ago

@dustymc I just deleted the two GenBank links until we get this figured out.

I'm not sure of the question. They collected a bird, sequenced its genome, sent us the specimen and tissue to archive. We need to link the specimen to the genome data, and they need to link back to the specimen record. Right now, there is no linkage either way.

dustymc commented 3 years ago

sequenced

Which is generally in nucleotide, which we already communicate with. I suspect that https://www.ncbi.nlm.nih.gov/genome/?term=txid296741%5Borgn%5D is an assemblage of nucleotide-bits, not something that exists on its own.

https://www.ncbi.nlm.nih.gov/genome/?term=txid296741%5Borgn%5D links to https://www.ncbi.nlm.nih.gov/nuccore/CM020535.1 (what we normally link to), it links to https://www.ncbi.nlm.nih.gov/genome/86579 which looks a lot like the first page to me.

ccicero commented 3 years ago

Yes, an assemblage of nucleotide bits.

https://www.ncbi.nlm.nih.gov/nuccore/CM020535.1 is one chromosome of many. We could link to each one separately, but it would be better (I think) to link to the assembly. https://www.ncbi.nlm.nih.gov/genome/?term=txid296741%5Borgn%5D

Or the original page: https://www.ncbi.nlm.nih.gov/assembly/GCA_009829145.1

dustymc commented 3 years ago

It seems to me that the nucleotide links are more directly "things you can do with a dead bird," and they're certainly more in line with everything else that gets linked to GB. If I'm understanding the situation, my vote would be for the normal links to GB plus the assembly (and whatever other cool things have been done with the sequences) as Media. (Or maybe it's a Publication? Those are also cool things that happen from sequences.)

campmlc commented 3 years ago

Whole genomes are not nucleotide sequences of a few hundred base pairs or a single gene generated by Sanger sequencing, they are assemblages/reconstructions of the entire genome of x billion base pairs from an organism using next generation sequencing, which generates orders of magnitude more data which can't just be visualized on the nucleotide sequence page at ncbi, hence a separate repository is needed and a lot more data storage. There are many such repositories in the works now outside of GenBank. We will need to figure out a way to link to these.

ccicero commented 3 years ago

True, but this particular assembly is in GenBank (along with other repositories) - so we need to figure that linkage too.

And whatever is done, there needs to be a linkage back to the voucher as well.

campmlc commented 3 years ago

Reviving this issue - we need a valid solution. Is there a paper that cites https://vgp.github.io/genomeark/Chiroxiphia_lanceolata/? Vertebrates Genome Project = GenomeArk ID?

ccicero commented 3 years ago

I have not yet seen a publication, but here is a response from author Emily Duval who provided the following info from Chris Balakrishnan and Mike Braun:

If you usually link to the BioSample, the link would be: https://www.ncbi.nlm.nih.gov/biosample/SAMN12620979/ and should be stable.

If you want to link to the assembly itself, link to: https://www.ncbi.nlm.nih.gov/assembly/GCF_009829145.1

Mike Braun recommends doing both! And says "Really, the museum community should be working with B10K and NCBI to establish best practices.

campmlc commented 3 years ago

So we need different "GenBank" urls in addition to nucleotide - we need biosample and assembly.

On Thu, Jun 3, 2021 at 4:33 PM Carla Cicero @.***> wrote:

  • [EXTERNAL]*

I have not yet seen a publication, but here is a response from author Emily Duval who provided the following info from Chris Balakrishnan and Mike Braun:

If you usually link to the BioSample, the link would be: https://www.ncbi.nlm.nih.gov/biosample/SAMN12620979/ and should be stable.

If you want to link to the assembly itself, link to: https://www.ncbi.nlm.nih.gov/assembly/GCF_009829145.1

Mike Braun recommends doing both! And says "Really, the museum community should be working with B10K and NCBI to establish best practices.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-854228857, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHCTJSPIHQXUV2CWLDTQ77LDANCNFSM4QRC4BEQ .

Jegelewicz commented 3 years ago

So we need different "GenBank" urls in addition to nucleotide - we need biosample and assembly.

NCBI Biosample is already in the code table.

Table https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type

Value NCBI Assembly

Definition A database providing information on the structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic sequence data.

Other ID BaseURL Base URL = https://www.ncbi.nlm.nih.gov/assembly/ Functional example = https://www.ncbi.nlm.nih.gov/assembly/GCF_009829145.1

KyndallH commented 3 years ago

Not sure how this plays into this all but we have specimens that have the NCBI Biosample along with the NCBI Sequence Read Archive Run ID linked in Arctos.

Screen Shot 2021-06-03 at 3 09 59 PM

UAM:Bird:11856

campmlc commented 3 years ago

I still think we should add all these separately but create a new ID for GenomeArk.

On Thu, Jun 3, 2021 at 5:10 PM Kyndall Hildebrandt @.***> wrote:

  • [EXTERNAL]*

Not sure how this plays into this all but we have specimens that have the NCBI Biosample along with the NCBI Sequence Read Archive Run ID linked in Arctos.

[image: Screen Shot 2021-06-03 at 3 09 59 PM] https://user-images.githubusercontent.com/16887896/120723204-c7faf180-c47d-11eb-8bba-788ec2ffa0dc.png

UAM:Bird:11856

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-854243858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBAJY7BRAY2W4ZH2THDTRADYBANCNFSM4QRC4BEQ .

dustymc commented 3 years ago

See above, I still don't think a "github sandbox" URI that they don't own or control meets the criteria to be called an identifier. Media can form the link and won't be mistaken for something it cannot be.

campmlc commented 3 years ago

I guess I don't understand how media can be used in this context. Can you explain?

Here are other repositories for genomes: SRA: https://www.ncbi.nlm.nih.gov/sra (this is the genome version of GenBank, I would think they would be connected, but you never know with NCBI...)

Ensemble: https://uswest.ensembl.org/index.html

ENA: https://www.ebi.ac.uk/ena/browser/home

On Thu, Jun 3, 2021 at 5:23 PM dustymc @.***> wrote:

  • [EXTERNAL]*

See above, I still don't think a "github sandbox" URI that they don't own or control meets the criteria to be called an identifier. Media can form the link and won't be mistaken for something it cannot be.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-854248467, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHCN45HMFCTN2LC5B3TRAFIFANCNFSM4QRC4BEQ .

dustymc commented 3 years ago

how media can be used in this context

Carla's record has been a good example for a while, I added https://handbook.arctosdb.org/how_to/How-to-Create-Media-Images.html#attach-anything-with-a-url-to-any-media-node to the docs.

Looks like even the authors don't recommend the github-thing, can we close this?

campmlc commented 3 years ago

Can we add Carla's example to the documentation? Also, I would never think to look in How to Create Media Images for a method to link to a non-image url such as genomic data Perhaps we make this searchable under a separate topic, and link to it from the Images page?

On Fri, Jun 4, 2021 at 8:18 AM dustymc @.***> wrote:

  • [EXTERNAL]*

how media can be used in this context

Carla's record has been a good example for a while, I added https://handbook.arctosdb.org/how_to/How-to-Create-Media-Images.html#attach-anything-with-a-url-to-any-media-node to the docs.

Looks like even the authors don't recommend the github-thing, can we close this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-854762647, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDUCDA6O4HZCBHAK43TRDODNANCNFSM4QRC4BEQ .

ccicero commented 3 years ago

I agree, seems strange to me to use media for this sort of thing. How would anyone know to look there for related genomic data?

I still don't really get why genomeark is different from other types of identifiers.

In any case, there are some other issues going on that should be addressed.

The Biosample references the Arctos GUID for the MVZ Bird record, but there is no link back to Arctos. Is this something we need to address with NCBI? https://www.ncbi.nlm.nih.gov/biosample/SAMN12620979/

Ideally we also want to link to the assembly at NCBI - how do we do that? https://www.ncbi.nlm.nih.gov/assembly/GCF_009829145.1

I don't see the Biosample linked to the Assembly either, seems like that should be done on the GenBank side of things?

Jegelewicz commented 3 years ago

The Biosample references the Arctos GUID for the MVZ Bird record, but there is no link back to Arctos. Is this something we need to address with NCBI?

My gut feeling is yes. I don't know who or how the reciprocal links for sequences were arranged, but whoever did that should probably contact whoever helped at GenBank and add these other resources to the protocol.

campmlc commented 3 years ago

I believe our original contact person there has passed away? We need to find a new contact.

On Sun, Jun 6, 2021 at 1:29 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

The Biosample references the Arctos GUID for the MVZ Bird record, but there is no link back to Arctos. Is this something we need to address with NCBI?

My gut feeling is yes. I don't know who or how the reciprocal links for sequences were arranged, but whoever did that should probably contact whoever helped at GenBank and add these other resources to the protocol.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-855450055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBFE5WLGAP67VQ64VYTTRPEC7ANCNFSM4QRC4BEQ .

ccicero commented 3 years ago

Yes, that is correct - Scott Federhen was our contact but he passed away several (?) years ago.

I don't know who the current contact might be. @dustymc ?

ccicero commented 3 years ago

Also, if we could get GenBank to link the Biosample to the Assembly, then we'd just need to worry about linking to the Biosample - correct?

dustymc commented 3 years ago

range to me to use media for this sort of thing.

You have an external resource, Media is build to include those. A decent thumbnail wouldn't hurt, but I don't think it's hard to discover now.

really get why genomeark is different from other types of identifiers.

I don't see any similarities! That record is but one component of something that's designed not to be stable or to grow, it doesn't do ANYTHING that other IDs do, why are we looking at identifiers?!

o link back to Arctos.

There are about 600 others in Arctos, they do link, it takes a while.

don't see the Biosample linked to the Assembly

It's there, top right of the form (at least in my browser).

then we'd just need to worry about linking to the Biosample - correct?

I agree, I don't think it's necessary to link from Arctos to all of the derivitaves, you're just setting yourself up for more work and confusing data when something you can't control changes.

know who the current contact might be.

I just use the linkout contact, they're good at telling me how to do existing things, but it's nothing like the relationship was when Scott was around. A higher-up or more dev-oriented contact would be very useful.

ccicero commented 3 years ago

OK, so I just added the NCBI Biosample identifier to the record and that linkout works:

https://arctos.database.museum/guid/MVZ:Bird:191951

One thing that's not clear from Arctos is that the Biosample represents the whole genome. Here's the definition of Biosample from NCBI, which is also in the Arctos CT - pretty broad, and could include samples used in any experiments. It would be nice to somehow flag Arctos records associated with whole genomes, so we should discuss how to do that.

I won't worry about downstream links as Dusty pointed out that the Assembly is linked, I just missed that.

I have the GenomeArk record linked as media, but it's ugly. I can email Emily for permission to use her image, but (probably a different issue) the description gets cut off. Maybe we need a 'short title' for media that has limited number of characters and will show completely on the detail page, then a more complete description?

ccicero commented 3 years ago

Also @dustymc , are you saying that there is a lag and the NCBI record will be linked back to Arctos? It's been a long time since this went up, so I don't think that's the issue (?).

campmlc commented 3 years ago

Agree that we need a simple way to find "genomes" - something equivalent to or better than our "tissues" flag at the top of the page.

On Mon, Jun 7, 2021 at 2:26 PM Carla Cicero @.***> wrote:

  • [EXTERNAL]*

Also @dusty https://github.com/dusty, are you saying that there is a lag and the NCBI record will be linked back to Arctos? It's been a long time since this went up, so I don't think that's the issue (?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856230945, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBEXJZ5MRM36SIOIVT3TRUTNVANCNFSM4QRC4BEQ .

dustymc commented 3 years ago

Biosample represents the whole genome.

I don't think it (necessarily) does, right? Biosamples produce sequences, those can be assembled into things like whole genome.

somehow flag Arctos records associated with whole genomes,

Are Assemblies anything else? If not, that might be sufficient reason to denormalize.

'short title' for media

https://github.com/ArctosDB/arctos/issues/2813

long time

https://github.com/ArctosDB/arctos/issues/1929

campmlc commented 3 years ago

Remind me why we can't just add a identifier called "Genome ID" with a url to whatever site as the ID? We could do this in addition to NCBI Biosample etc, but someone could search on Genome IDs and get all the different flavors, varieties, and urls. ?

On Mon, Jun 7, 2021 at 4:30 PM dustymc @.***> wrote:

  • [EXTERNAL]*

Biosample represents the whole genome.

I don't think it (necessarily) does, right? Biosamples produce sequences, those can be assembled into things like whole genome.

somehow flag Arctos records associated with whole genomes,

Are Assemblies anything else? If not, that might be sufficient reason to denormalize.

'short title' for media

2813 https://github.com/ArctosDB/arctos/issues/2813

long time

1929 https://github.com/ArctosDB/arctos/issues/1929

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856304921, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHSECISC7KJANOC5XTTRVCCDANCNFSM4QRC4BEQ .

ccicero commented 3 years ago

Correct, Biosample is not necessarily the entire genome.

I like Mariel's suggestion of a 'Genome ID' with whatever identifier goes with it.

My email to Emily Duval re: the manakin voucher is from February, and I think that's been changed for at least two months, maybe more? So still not sure that's the issue. I can submit a query to NCBI about that.

@ewommack Media issues re: how descriptions display on record detail page: can we add to one of the next issues meetings?

Jegelewicz commented 3 years ago

Media issues re: how descriptions display on record detail page: can we add to one of the next issues meetings?

Please see #2813

dustymc commented 3 years ago

"Genome ID"

So Media, but without any of the tools - that's more palatable than creating a dedicated type for a temporary URI, but I can't say I love it either.

Paralleling https://github.com/ArctosDB/arctos/issues/3593 would provide a more consistent approach - I'm not sure that's critical, but it usually finds a way to turn out to be a Good Thing.

query to NCBI

My janitor scripts have cleaned out what I actually sent them, but I can confirm that the scripts currently build a file with your record...

query: SAMN12620979
base: &base.url;
rule: MVZ:Bird:191951
name: MVZ:Bird:191951

... and that NCBI claims to have received something 11 days ago.

Screen Shot 2021-06-07 at 6 13 02 PM

There's no nucleotide entry on your bird - I think there should be, maybe that's somehow confusing some script somewhere?

campmlc commented 3 years ago

Using Genome ID this way makes the link searchable by anyone. I could find all records in all collections with an identifier that = Genome ID. We could still do a media linkage to get to the tools - no reason not to have both. But using an identifier is where people EXPECT to find this info - just like GenBank accessions.

On Mon, Jun 7, 2021 at 7:20 PM dustymc @.***> wrote:

  • [EXTERNAL]*

"Genome ID"

So Media, but without any of the tools - that's more palatable than creating a dedicated type for a temporary URI, but I can't say I love it either.

Paralleling #3593 https://github.com/ArctosDB/arctos/issues/3593 would provide a more consistent approach - I'm not sure that's critical, but it usually finds a way to turn out to be a Good Thing.

query to NCBI

My janitor scripts have cleaned out what I actually sent them, but I can confirm that the scripts currently build a file with your record...

query: SAMN12620979 base: &base.url; rule: MVZ:Bird:191951 name: MVZ:Bird:191951

... and that NCBI claims to have received something 11 days ago.

[image: Screen Shot 2021-06-07 at 6 13 02 PM] https://user-images.githubusercontent.com/5720791/121107060-17ae2580-c7bc-11eb-8ca5-65e451dc6fee.png

There's no nucleotide entry on your bird - I think there should be, maybe that's somehow confusing some script somewhere?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856367588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBFF5WJEFSKYBFUDVK3TRVV7LANCNFSM4QRC4BEQ .

campmlc commented 3 years ago

This also begs the question that comes with GenBank Accessions as well - is there any way to add a date and determiner to Other IDs? We get multiple GenBank accessions from multiple papers/citations over time on well researched specimens - yet they are all just lumped in a random list in identifiers.

On Mon, Jun 7, 2021 at 9:55 PM Mariel Campbell @.***> wrote:

Using Genome ID this way makes the link searchable by anyone. I could find all records in all collections with an identifier that = Genome ID. We could still do a media linkage to get to the tools - no reason not to have both. But using an identifier is where people EXPECT to find this info - just like GenBank accessions.

On Mon, Jun 7, 2021 at 7:20 PM dustymc @.***> wrote:

  • [EXTERNAL]*

"Genome ID"

So Media, but without any of the tools - that's more palatable than creating a dedicated type for a temporary URI, but I can't say I love it either.

Paralleling #3593 https://github.com/ArctosDB/arctos/issues/3593 would provide a more consistent approach - I'm not sure that's critical, but it usually finds a way to turn out to be a Good Thing.

query to NCBI

My janitor scripts have cleaned out what I actually sent them, but I can confirm that the scripts currently build a file with your record...

query: SAMN12620979 base: &base.url; rule: MVZ:Bird:191951 name: MVZ:Bird:191951

... and that NCBI claims to have received something 11 days ago.

[image: Screen Shot 2021-06-07 at 6 13 02 PM] https://user-images.githubusercontent.com/5720791/121107060-17ae2580-c7bc-11eb-8ca5-65e451dc6fee.png

There's no nucleotide entry on your bird - I think there should be, maybe that's somehow confusing some script somewhere?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856367588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBFF5WJEFSKYBFUDVK3TRVV7LANCNFSM4QRC4BEQ .

dustymc commented 3 years ago

no reason not to have both

There are few things you could do to make the data less accessible.

ccicero commented 3 years ago

@dustymc I don't understand how the NCBI scripts work, but it sounds like something is not processing correctly on their end so I will contact them. Can you please forward that linkout email to my email? Thanks.

I agree with @campmlc that one would expect to look under identifiers, not media, for genome-related information.

Jegelewicz commented 3 years ago

I suggest that you guys either change the title of this issue and request an OtherID = Genome ID with a definition and all the appropriate stuff, including making it free text so that you can add whatever-the-heck, including GitHUb urls or open another issue and do that.

ccicero commented 3 years ago

@Jegelewicz I just changed the title of this issue to make it more generic. If we agree that an OtherID of 'Genome ID' is what we want, I can create a new issue for that.

Re: the linking issue, here is the case and response from NCBI re: linking to the manakin sequence associated with: https://arctos.database.museum/guid/MVZ:Bird:191951 https://www.ncbi.nlm.nih.gov/biosample/SAMN12620979


From: linkout@ncbi.nlm.nih.gov Date: Thu, May 27, 2021 at 4:43 PM Subject: LinkOut FT files sent to NCBI - acknowledgement To: dustymc@gmail.com

Dear staff at Arctos Specimen Database,

We received your LinkOut Simple Text (FT) file "biosample_1.ft", date "2021-05-27,06:10:02. It will be processed and the links will be available to users in 48 hours.


Case Information: Case #: CAS-734413-W7Y8C0 Customer Name: Carla Cicero Customer Email: ccicero@berkeley.edu Case Created: 6/8/2021, 12:34:11 PM

Summary: Biosample linkout issue

Details: I am writing re: linkouts from NCBI Biosamples to the Arctos database. This Arctos record is linked to Biosample SAMN12620979, and the Biosample should link back to the Arctos record through either the Specimen Voucher or Voucher URL field as part of an automated script. We received this message, but the linkout still has not happened. We want to be sure that the script is working correctly. Can you please check the linkout script issue and resolve it so that the Biosample links back to Arctos? Thank you.


RESPONSE 17Jun 2021: I reviewed the file uploaded on May 27th for Biosample: biosample_1.ft and I found that there isn’t an entry for SAMN12620979.

Lidia Hutcherson LinkOut Development Team


Question for @dustymc - it sounds like it's on the Arctos end in the LinkOut Simple Text (FT) file "biosample_1.ft" that gets submitted by Arctos - ??? Does that make sense to you?

dustymc commented 3 years ago

I think maybe this should be merged with https://github.com/ArctosDB/arctos/issues/3652?

See https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856367588, I think the scripts are working properly and suspect it was just bad timing. The next push is scheduled to happen in 3 days, suggest we see if that works and if it doesn't I'll the data to figure out why with GenBank.

ccicero commented 3 years ago

@dustymc OK, let's see what happens in the next push. How often does that push occur?

dustymc commented 3 years ago

How often

https://github.com/ArctosDB/arctos/issues/3068#issuecomment-856304921

https://github.com/ArctosDB/arctos/issues/1929

monthly

dustymc commented 3 years ago

Confirmed linkouts are working

Screen Shot 2021-06-28 at 7 11 32 AM