ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Error bulkloading other identifiers #6064

Closed campmlc closed 8 months ago

campmlc commented 1 year ago

I'm trying to bulkload other identifiers that include some new flavors of GenBank identifiers that are not, actually, GenBank but rather different databases at NCBI, such as BioSample (which is in Arctos) and Sequence Read Archive (SRA) (which is not). I get the following error message. Not sure it is me doing something weird, or that I included a url in remarks (since we don't have a url field in the other identifier bulkloader yet), or? Help?

data file trying to load: jbi14362-sup-0001-tabels1_GenBank IDs bulkload.zip

Screenshot 2023-03-24 21 14 48

dustymc commented 1 year ago

This is like campaign material for what I've been trying to say the last couple weeks...

Sequence Read Archive (SRA) (which is not).

Of course it is, https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#ncbi_sequence_read_archive_run_id

BUT - IDK what you've got here, but it's not that - and what the heck is a 'read archive run ID"?? Anyway, whatever we have doesn't seem to like whatever you have.

We've tried to figure it all out a few times with a few people and maybe everyone else understands it, but I do not. The really great news is, I don't have to (and neither do you, nor does the next person). We don't even have to care if this is an independent identifier (I think maybe one of the conversations decided it was not??) or the CORRECT identifier (vague memories of perhaps being able to get there from nucleotide and this is all redundant??). None of that matters: Choose some easy-to-understand type and enter the full identifier - https://www.ncbi.nlm.nih.gov/sra/SRS11217939, never SRS11217939 - and you've done something that's functionally identical to understanding all of GenBank, understanding all of Arctos, choosing the correct type (or requesting a new one), tearing the identifiers apart, and putting it back together properly for load. (And if that is redundant or wrong then the next person should still have no trouble back-tracking, finding, and perhaps creating what they wish you'd have done.)

The ONLY difference between these and the nonresolvable types which have caused so much anxiety over the last few weeks is that when something is wrong here, it's immediately noticeable - the system KNOWS its wrong and tells you about it Without that, you have to hope that whatever you've done makes sense to the next user (in a decade or century, and very likely with a wildly different background and assumptions), and clearly the more compartments there are the less likely that is to be true.

I would still use identifier (which is still not a thing because https://github.com/ArctosDB/arctos/issues/6005 is somehow stuck) for these (they're self-documenting through function and really don't need much metadata), but some sort of "genetic identifier" probably isn't an unrecoverable overcategorization, and we do seem to love pigeonholes.

Remarks is never an appropriate place for identifiers. It should be reserved for remarkable information - "tag faded maybe that's a three?"

You (fortunately, perhaps) have a bunch of blank columns at the right and a bunch of blank rows at the bottom of your CSV.

Jegelewicz commented 1 year ago

Here is what I recommend - but I don't understand why the biosamples are "same individual as" and the others are "self"....

jbi14362-sup-0001-tabels1_GenBank IDs bulkload.csv

dustymc commented 1 year ago

"same individual as"

This is a reference to a different record. "Same alleged source organism, but we didn't supply the material."

self

"current cataloged item" - the ID belongs to "this" record because the parts which lead to the identifier came from this record.

Jegelewicz commented 1 year ago

These are really confusing concepts and I sorta dislike the "self" thing because two parts of one "self" can be cataloged at two different institutions (and I think this goes for sequences derived from a "self"). Why are we treating sequences cataloged at NCBI different from skins cataloged at some other institution that a skeleton from of the same organism?