Closed ebraker closed 1 year ago
- MorphoSource ARK - autoassigned (considered by MS team to be most stable identifier along with DOI) base path: https://n2t.net/ark:/ (FYI we already have other_ID=ARK that uses this path
- MorphoSource DOI - assigned upon request (considered by MS team to be most stable identifier along with ARK) base path: https://doi.org/
Seems like those should just be
Especially since "FYI we already have other_ID=ARK that uses this path"
Agreed but perhaps we create a 'MorphoSource ARK' value so that we can easily retrieve records linked to MorphoSource? I'm currently using arks for other things and would like to be able to narrow MS-relevant queries.
It just means we will end up with an infinite number of XXX ARK and XXX DOI. Dusty already hates that table....
Yeah...I figured as much. I don't love it either but I think it would promote consistency in users grabbing the preferred MorphoSource identifier for linking 3D media.
I've said my piece, others can weigh in.
hates that table...
Just the parts of it without a good base_url.
I sorta lean towards including https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#ark in 'not good' - it DOES STUFF which is pretty great, but users can't have any idea what it's going to do until they click it. Image? Text? Related? 50 terabytes of cat pictures? Click here to find out....
If we can agree on that much, then we need to talk about what the identifier is. "65665/339ec2704-9d3b-4ee6-a0b1-ddb05dfff745" SORTA is, but not really - I'm not so sure the components are separable, http://n2t.net/ark:/65665/339ec2704-9d3b-4ee6-a0b1-ddb05dfff745 is the actual ID, we don't need a base at all.
BUT of course we use base for all sorts of things so that's - well, mildly inconvenient, maybe.
If they're actually generating ARKs for everything, that's definitely what should be used - they seem to get a new URL every few weeks, we can't deal with that, ARKs can (assuming they're updating the metadata....).
As for what to do with it - I don't know. I'd probably lean towards
http://n2t.net/ark:
(which I could check via trigger - I already do that for some types)??????????????????????????????????????????
As for what to do with it - I don't know. I'd probably lean towards type==>MorphoSource Thingee ID base_url====>NULL expected value: something that starts with and does not end with http://n2t.net/ark: (which I could check via trigger - I already do that for some types)
I was just about to comment that this works for me, but now I realize we may want to have this be the best practice for MEDIA linked to MorphoSource rather than identifiers in the catalog record. 'MorphoSource Species ID' might be best linked with other_IDs since Species ID takes users to a landing page for the catalog record in MorphoSource, to which any number of media may be linked (see example). Unfortunately the Species ID page does not have an ARK identifier.
Therefore, proposed values:
best practice for MEDIA linked
It really depends on what you're trying to do. (https://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html#issue-protips) If this is Step One in plugging into an API or similar then Media is the wrong tool, if no then maybe Media makes this a lot more approachable.
We'd talked about using MS as a sort of "viewer" at one point, I'm not sure how viable that still is. My normal recommendation would be to set those up as catalog record--->whatever's nicest in a browser--->archival bits
. That includes a presumption that the visible/middle piece is stable, and that doesn't look like a sure thing from here - always your call, but perhaps it's better to link everything directly to the catalog record.
Media does not see any sort of predictable formula, you can just link to whatever URL you want people to access.
MorphoSource Specimen ID;
Can that be reconciled with https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#morphosource ? I really don't want two of these, there's no way that will be used properly, hopefully they won't push us into that corner....
@Jegelewicz @ebraker @dustymc @atrox10 @mkoo
Revisiting this for discussion at the Code Table meeting 1/20/2022. Carol, Michelle, and I met with Doug and Julie from Morphosource last week to discuss a number of things. Relevant to Arctos is that the base URL in our code table is not correct - it worked for MS1 (legacy records are being redirected) but is not the best URL for MS2. Also, specimen IDs in MS2 require 9 characters.
Current base URL in Arctos: https://www.morphosource.org/Detail/SpecimenDetail/Show/specimen_id/
As noted above and confirmed by Julie, the base URL to use for MS2 is https://www.morphosource.org/biological_specimens/
For example, MVZ Herp 100404 has the identifier number in Arctos as 3745. That is being redirected to https://www.morphosource.org/biological_specimens/0000S3745. The padded '0's and 'S' were added in MS2 to satisfy the 9 character requirement.
The issue is that '3745' is being passed to data aggregators (iDigBio, GBIF), and those aren't linked so if you search just '3745' in MS you get a different specimen. You need to search 0000S3745 in MS to get the correct specimen. Not intuitive!
Required action: 1) change base URL to https://www.morphosource.org/biological_specimens/ in the Arctos code table 2) Enter the full specimen ID with padded '0's and 'S' to linkout correctly and to pass the correct identifier number to data aggregators.
From Julie:
"A good canonical specimen URL formula would be https://www.morphosource.org/biological_specimens/
Sounds good. I asked Julie if MS would create ARKs for the biological specimen page as they've done with Media pages, since it is more stable than a base URL that may change over time. I'm holding off on linking MorphoSource identifiers until then (though I'm linking to media ARKs when I create media).
You should be able to change the base_url, and the existing otherID bulkloaders should not have any problem updating (removing and adding) the data.
You should be able to change the base_url
Will this mess anyone up? I will change it if not.
I have an Arctos question on behalf of a volunteer back in Seattle working on scanning and trying to attach scans to records.
Do 3-D media need to be uploaded to someplace else (eg Morphosource) before attaching in Arctos? I assumed there was a direct shortcut to upload 1 file at a time, direct from our desktop to the record, as there is for images. However our failures are making me think that 3-D media might be required to have a URI from someone else before they can be linked.
Is this correct? Is a good workaround to create a Morphosource account for us and upload media there first?
Thanks-
Jeffrey E. Bradley
Jeff,
I am guessing it is the size of the files that is the issue. The size of CT scans means the transfer is likely to time out when completed via a browser in Arctos. As a community we have been discussing our shared storage space at TACC and how best to ensure we have enough available. See the Arctos Digital Media Policy that we recently created. There are also discussions of creating links to Morphosource in Github.
MorphoSource links that work like GenBank?
Code Table Request - MorphoSource Identifiers
From my perspective, I think that uploading these to morphosource allows for greater discoverability (lots of people look for CT scans there) and a place to record details about the scan in purpose-built repository which is preferable to simply putting them at TACC in an Arctos directory. Once they are at Morphosource, in Arctos they can be linked to the appropriate catalog record using the other identifier "Morphosource" ID allowing people to easily get to the scan from the Arctos catalog record. I would recommend that in the Morphosource record you put the url for the catalog item in the "external object url"
This will allow users in Morphosource to quickly get to the Arctos record that includes details about the physical specimen.
I am copying Dusty here because he may disagree or have a better solution. I am also going to add this to one of the Github issues because it is an important question that others will probably have as well!
Adios,
Teresa J. Mayfield-Meyer
@ebraker @ccicero thoughts on Jeff's question?
Yes the UI-based tool has restrictions.
Arctos certainly allows treating MorphoSource (or anything else) as the primary/only Media data. I don't disagree with anything said, but if they were my data I'd try to find a place for another copy. (Tape at TACC is stable and relatively cheap, but there are long-term costs associated with any stable long-term storage.)
I agree that at this point in time, loading CT media and linking to MorphoSource from Arctos is the most straightforward approach. I think eventually many of us with CT media would like to host through Arctos via TACC, but need to figure out storage costs and file transfer methods and download actions. My CT scans are generally around 20 GB (upwards to 40GB) each - I'm not sure what it would like like for an end user to access that size of a file. The benefit of MorphoSource is they have a download module which gives the institution the ability to control how users access files - free download vs. must request permission to download (with fields to enter brief description of their project), vs. private media simply hosted at MorphoSource. I think we can replicate some sort of layered download permissions approach in Arctos...right now if you load media at TACC and link to Arctos, it is publicly accessible (which is great!), but you may want to have a more restrictive model to better track usage via loans and also know the specific users downloading media since there may licensing issues to watch (limits on 3D printing, commercial use, etc.).
MorphoSource also allows outside users to manipulate downloaded media and reupload and link the derivatives to the parent media which is nice (e.g., segmenting out a snake skull from an original full body CT scan, or creating a surface mesh, etc.).
@jebrad see the responses here and let me know if I can help your student!
This is very helpful indeed, thanks all especially @Jegelewicz - We are working on the morphosource approach but keeping Dusty's concerns in mind. Jeff
@ebraker is this request still open? What do we need to do?
Kinda sorta. My wish is that MorphoSource would mint ARKs for the biological specimen records in addition to their media records, but I'm not sure it will happen. The biological specimen pages summarize ALL media linked to an individual, which would be ideal for other_IDs. Closing.
https://github.com/ArctosDB/arctos/issues/3847#issuecomment-1015567362 wasn't completed and this should not have been closed.
I'll change https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#morphosource to use https://www.morphosource.org/biological_specimens/
I will then set up a bot to find Arctos stuff in MS, create the otherID if it doesn't exist, and potentially add eg https://n2t.net/ark:/87602/m4/469370 as Media.
This will not fix existing broken links, and I think every single link in Arctos is broken (identifiers don't seem to be the right flavor).
Collections will need to grant access to the bot if they want the automation (and AFAIK the one existing bot has never been been used, despite all the discussion that lead to it existing).
Here's existing data - the bulk unloader/loader should be capable of fixing them, or you can just unload and turn the bot on.
select
guid,
other_id_prefix,
other_id_number,
other_id_suffix
from
flat
inner join coll_obj_other_id_num on coll_obj_other_id_num.collection_object_id=flat.collection_object_id and
other_id_type ='Morphosource'
order by guid;
guid | other_id_prefix | other_id_number | other_id_suffix
-----------------+-----------------+-----------------+-----------------
MSB:Mamm:309389 | | 32504 |
MSB:Mamm:94268 | S | 45218 |
MVZ:Bird:127163 | | 18488 |
MVZ:Bird:127168 | | 18489 |
MVZ:Bird:138066 | | 18490 |
MVZ:Herp:100404 | | 3745 |
MVZ:Mamm:106555 | | 11318 |
MVZ:Mamm:114370 | | 10788 |
MVZ:Mamm:116834 | | 11415 |
MVZ:Mamm:132535 | | 11315 |
MVZ:Mamm:144306 | | 11312 |
MVZ:Mamm:144307 | | 11314 |
MVZ:Mamm:174521 | | 11317 |
MVZ:Mamm:179796 | | 11443 |
MVZ:Mamm:183386 | | 11295 |
MVZ:Mamm:42613 | | 11409 |
UCM:Herp:41065 | | 46656 |
UCM:Herp:41070 | | 46657 |
UCM:Herp:45006 | | 46658 |
UCM:Herp:67231 | | 45486 |
UTEP:Ento:11024 | | 44755 |
UTEP:Ento:1425 | | 44734 |
UTEP:Ento:17094 | | 44756 |
UTEP:Ento:17095 | | 44757 |
UTEP:Ento:17096 | | 44759 |
UTEP:Ento:17097 | | 44760 |
UTEP:Ento:3317 | | 44735 |
UTEP:Ento:3336 | | 44738 |
UTEP:Ento:4702 | | 44741 |
UTEP:Ento:5818 | | 44742 |
(30 rows)
Came here to find this issue: Can you send me an example record with the new MS as media ark?
I'm not sure what's "new" and not but above - eg https://n2t.net/ark:/87602/m4/469370 - came from prepending a resolver to an ark from https://www.morphosource.org/api/media?physical_object_id=000469363
Chiming in that i like using the MS specimen ID too since then it's a link that consolidates views of many files (with arks).
We were adding that as another Identifier (Morphosource) but they use alphanumerics
and only integers are currently allowed (can that be changed?)--> below will generate an error
Just use prefix - casting to any kind of numeric strips the leading zeroes (in everything other than whatever they're using, maybe....)
Code table updated
The linker bot is functional and can be released to production at any time.
I'm going to split the media creation off into a separate bot for various reasons - it'll spread the load/work better with our resources, it'll give collections better control (eg to manually create MS links but allow automagic media creation), it'll handle new media showing up later, etc., just better architecture with a very minimal cost (grant access to a second bot in order to get the full package).
I think I will unload the malformed links and leave them here as CSV before releasing the bot - MS has added a 'fix the padding' handler so the existing links (except the one MSB record that has an 'S' prefix for some reason) do work, but they are not valid identifiers and cannot be used for things like fetching Media (and I'm not sure if they're going to the correct place or not, this environment doesn't seem particularly stable, they should be checked). Mixing those in with what the bot will do seems like a recipe for a giant mess.
@jldunnum @AdrienneRaniszewski @campmlc @ccicero @mkoo @cjconroy @atrox10 @ebraker @Jegelewicz
the malformed records (and SQL to find them) are a couple comments up if you want to fix them, otherwise I'll delete them with next release (and you can set the bot to re-create them or re-create them from the CSV I'll leave here).
@dustymc UCM is happy to unleash the bot on our dataset...I'll add the bot agent to UCM collections once it is in production
@ebraker want media too? I can turn one or both on and run them manually for you, it's always nice to have a real-world test of these things.
Calling this next release.
Granting morphosource_bot access to your collection will result in...
... and morphosource_media_bot will use that to....
from https://arctos-test.tacc.utexas.edu/guid/MVZ:Herp:127623
The media loader check does fail fairly often (I think it's Morphosource but could be n2t), those will be in the media bulkloader as...
... set them to autoload if you see them, or they will try again the next time around.
I'm not sure of the schedule yet, maybe monthly for the identifier link and recheck for more media every 6 months - I'm very open to better suggestions.
@dustymc Great! Let's do it. I've created media for our existing MS records - will the media bot duplicate these established ARKs? I definitely want the biological specimen bot, and if there isn't a risk of duplication, the media bot will be great moving forward since it will save me from doing my own MS media bulkloads every month.
yes, MVZ is in! how do we enable the bots ourselves? (just wondering) Feel free to run for us @dustymc Thanks!
@ebraker thanks, yes that would've made a mess - you used http://n2t.net/ark:..., I used https://n2t.net/ark:... I'll fix that and file more issues.
BUT...
...the mess would have been easily attributed to agent morphosource_media_bot - nuke everything, fix the bot, let it try again - no problem, and why I'm now happy to set scripts to go bash around in your collections.
@mkoo https://handbook.arctosdb.org/documentation/bot.html - I'll grant MVZ and get things started, should be tonight unless I break something especially 'interesting' today.
hmm, that page doesnt answer my question. I guess this new bot is too new to see it in Arctos-prod. or see anything of the details of what it does... i'll look to test later
@mkoo to add a bot, grant it access to your collection like any other agent you grant access. To find the bot's username look for the bots in agents as agent type = bot.
Click the little [ Arctos user ] link
to grant the bot access to a collection
@dustymc there is currently only 1 bot that we can select from....
too new
Yup, bots are paranoid, I have to get the big password out to make them. Tonight...
No more issues because generated columns are awesome, next release will include a unique index on protocol-stripped media URIs, morphosource_media_bot will use it to avoid making messes.
The pair of bots working together sort of accidentally uncovered something that might need more investigation. There's a loan, https://arctos-test.tacc.utexas.edu/guid/MVZ:Herp:111742 was shipped to 'straya 8 years ago, a morphosource record was created, then - ??? Specimens sold on ebay? There are amazing images on floppy disks in some grad student desk?? Who knows, but there's no media in MS so the record ends up with a link and without media which was unexpected and caught my attention. @atrox10
Nuked data: temp_morphosource_malformed.csv.zip
@dustymc how often do these bots run? I just loaded some MorphoSource records this morning and want to know when I can double check that they've been ingested and the corresponding media created in Arctos (this is the point where I'd make a media bulkloader, but I will hold off and trust the magic...)
If you've got the superpowers, the real answer is...
and you (probably, I hope, usually...) can't break anything by clicking links in there if you don't feel like waiting for the bot.
Let me know how to find a record if something magic didn't happen.
@dustymc I never edited the task since the bot_morphosource_media runs daily (and that is plenty), however, none of the media posted to MS on 2022-10-05 have been pulled in:
https://arctos.database.museum/guid/UCM:Herp:64759 https://arctos.database.museum/guid/UCM:Herp:67223 https://arctos.database.museum/guid/UCM:Herp:52560 https://arctos.database.museum/guid/UCM:Herp:52515 https://arctos.database.museum/guid/UCM:Herp:48325 https://arctos.database.museum/guid/UCM:Herp:58230 https://arctos.database.museum/guid/UCM:Herp:47446 https://arctos.database.museum/guid/UCM:Herp:41273 https://arctos.database.museum/guid/UCM:Herp:40117 https://arctos.database.museum/guid/UCM:Herp:39997 https://arctos.database.museum/guid/UCM:Herp:39912 https://arctos.database.museum/guid/UCM:Herp:39609 https://arctos.database.museum/guid/UCM:Herp:24616
These should have new 3D meshes alongside the existing CT Tiffs are already created in Arctos (see MS project for corresponding media IDs).
BUT, one thing I wanted to ask is if we can make this bot add some relevant metadata from MS. Usually I add the following:
I assume it is not possible to customize the bot? I imagine it may be possible to pull in agent and date, but I'd also like to link a Project and generate a description, so I may end up just continuing with media bulkloads...
I'm checking MS by collection, which works out to be
and those aren't in there.
I'm very open to better ideas if you happen to know something I don't!
The bots can definitely be made different (maybe even smarter) but yea, magicking Agents from current data seems a bit optimistic. Hopefully as https://www.gbif.org/new-data-model matures those identifiers will be more shared and MS will come to believe that idigbio occurrences aren't real identifiers, but that's not much of a now-solution.
Strange, I'm not sure why its not finding these media. Anyhow, for now I'll stick with bulkloads for now so I can populate desired media metadata. I've revoked access for the media bot for UCM collections, but will keep the morphosource identifier bot so at least arks will be pulled into identifiers in addition to my manual media loads.
It's not finding the "physical-object" (which is in turn used to find Media). This bit fails:
and this bit (note its a different bot) is never attempted without that identifier:
There are several ways to link to data in MorphoSource. We should update our code table base URI and definitions so that we are correctly providing EITHER a specimen ID or a media ID under other_ID=MorphoSource (or simply create two values - MorphoSoruce Media ID and MorphoSource Specimen ID)...otherwise broken links are likely.