ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
59 stars 13 forks source link

Code Table Request - MorphoSource Identifiers #3847

Closed ebraker closed 1 year ago

ebraker commented 2 years ago

There are several ways to link to data in MorphoSource. We should update our code table base URI and definitions so that we are correctly providing EITHER a specimen ID or a media ID under other_ID=MorphoSource (or simply create two values - MorphoSoruce Media ID and MorphoSource Specimen ID)...otherwise broken links are likely.

Jegelewicz commented 2 years ago
  • MorphoSource ARK - autoassigned (considered by MS team to be most stable identifier along with DOI) base path: https://n2t.net/ark:/ (FYI we already have other_ID=ARK that uses this path
  • MorphoSource DOI - assigned upon request (considered by MS team to be most stable identifier along with ARK) base path: https://doi.org/

Seems like those should just be

Especially since "FYI we already have other_ID=ARK that uses this path"

ebraker commented 2 years ago

Agreed but perhaps we create a 'MorphoSource ARK' value so that we can easily retrieve records linked to MorphoSource? I'm currently using arks for other things and would like to be able to narrow MS-relevant queries.

Jegelewicz commented 2 years ago

It just means we will end up with an infinite number of XXX ARK and XXX DOI. Dusty already hates that table....

ebraker commented 2 years ago

Yeah...I figured as much. I don't love it either but I think it would promote consistency in users grabbing the preferred MorphoSource identifier for linking 3D media.

Jegelewicz commented 2 years ago

I've said my piece, others can weigh in.

dustymc commented 2 years ago

hates that table...

Just the parts of it without a good base_url.

I sorta lean towards including https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#ark in 'not good' - it DOES STUFF which is pretty great, but users can't have any idea what it's going to do until they click it. Image? Text? Related? 50 terabytes of cat pictures? Click here to find out....

If we can agree on that much, then we need to talk about what the identifier is. "65665/339ec2704-9d3b-4ee6-a0b1-ddb05dfff745" SORTA is, but not really - I'm not so sure the components are separable, http://n2t.net/ark:/65665/339ec2704-9d3b-4ee6-a0b1-ddb05dfff745 is the actual ID, we don't need a base at all.

BUT of course we use base for all sorts of things so that's - well, mildly inconvenient, maybe.

If they're actually generating ARKs for everything, that's definitely what should be used - they seem to get a new URL every few weeks, we can't deal with that, ARKs can (assuming they're updating the metadata....).

As for what to do with it - I don't know. I'd probably lean towards

??????????????????????????????????????????

ebraker commented 2 years ago

As for what to do with it - I don't know. I'd probably lean towards type==>MorphoSource Thingee ID base_url====>NULL expected value: something that starts with and does not end with http://n2t.net/ark: (which I could check via trigger - I already do that for some types)

I was just about to comment that this works for me, but now I realize we may want to have this be the best practice for MEDIA linked to MorphoSource rather than identifiers in the catalog record. 'MorphoSource Species ID' might be best linked with other_IDs since Species ID takes users to a landing page for the catalog record in MorphoSource, to which any number of media may be linked (see example). Unfortunately the Species ID page does not have an ARK identifier. image

Therefore, proposed values:

dustymc commented 2 years ago

best practice for MEDIA linked

It really depends on what you're trying to do. (https://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html#issue-protips) If this is Step One in plugging into an API or similar then Media is the wrong tool, if no then maybe Media makes this a lot more approachable.

We'd talked about using MS as a sort of "viewer" at one point, I'm not sure how viable that still is. My normal recommendation would be to set those up as catalog record--->whatever's nicest in a browser--->archival bits. That includes a presumption that the visible/middle piece is stable, and that doesn't look like a sure thing from here - always your call, but perhaps it's better to link everything directly to the catalog record.

Media does not see any sort of predictable formula, you can just link to whatever URL you want people to access.

MorphoSource Specimen ID;

Can that be reconciled with https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#morphosource ? I really don't want two of these, there's no way that will be used properly, hopefully they won't push us into that corner....

ccicero commented 2 years ago

@Jegelewicz @ebraker @dustymc @atrox10 @mkoo

Revisiting this for discussion at the Code Table meeting 1/20/2022. Carol, Michelle, and I met with Doug and Julie from Morphosource last week to discuss a number of things. Relevant to Arctos is that the base URL in our code table is not correct - it worked for MS1 (legacy records are being redirected) but is not the best URL for MS2. Also, specimen IDs in MS2 require 9 characters.

Current base URL in Arctos: https://www.morphosource.org/Detail/SpecimenDetail/Show/specimen_id/

As noted above and confirmed by Julie, the base URL to use for MS2 is https://www.morphosource.org/biological_specimens/

For example, MVZ Herp 100404 has the identifier number in Arctos as 3745. That is being redirected to https://www.morphosource.org/biological_specimens/0000S3745. The padded '0's and 'S' were added in MS2 to satisfy the 9 character requirement.

The issue is that '3745' is being passed to data aggregators (iDigBio, GBIF), and those aren't linked so if you search just '3745' in MS you get a different specimen. You need to search 0000S3745 in MS to get the correct specimen. Not intuitive!

Required action: 1) change base URL to https://www.morphosource.org/biological_specimens/ in the Arctos code table 2) Enter the full specimen ID with padded '0's and 'S' to linkout correctly and to pass the correct identifier number to data aggregators.

From Julie: "A good canonical specimen URL formula would be https://www.morphosource.org/biological_specimens/ As an example, with ID 000S43823 it would be https://www.morphosource.org/biological_specimens/000S43823"

ebraker commented 2 years ago

Sounds good. I asked Julie if MS would create ARKs for the biological specimen page as they've done with Media pages, since it is more stable than a base URL that may change over time. I'm holding off on linking MorphoSource identifiers until then (though I'm linking to media ARKs when I create media).

dustymc commented 2 years ago

You should be able to change the base_url, and the existing otherID bulkloaders should not have any problem updating (removing and adding) the data.

Jegelewicz commented 2 years ago

You should be able to change the base_url

Will this mess anyone up? I will change it if not.

Jegelewicz commented 2 years ago

I have an Arctos question on behalf of a volunteer back in Seattle working on scanning and trying to attach scans to records.

Do 3-D media need to be uploaded to someplace else (eg Morphosource) before attaching in Arctos? I assumed there was a direct shortcut to upload 1 file at a time, direct from our desktop to the record, as there is for images. However our failures are making me think that 3-D media might be required to have a URI from someone else before they can be linked.

Is this correct? Is a good workaround to create a Morphosource account for us and upload media there first?

Thanks-


Jeffrey E. Bradley

Jegelewicz commented 2 years ago

Jeff,

I am guessing it is the size of the files that is the issue. The size of CT scans means the transfer is likely to time out when completed via a browser in Arctos. As a community we have been discussing our shared storage space at TACC and how best to ensure we have enough available. See the Arctos Digital Media Policy that we recently created. There are also discussions of creating links to Morphosource in Github.

MorphoSource links that work like GenBank?

Code Table Request - MorphoSource Identifiers

From my perspective, I think that uploading these to morphosource allows for greater discoverability (lots of people look for CT scans there) and a place to record details about the scan in purpose-built repository which is preferable to simply putting them at TACC in an Arctos directory. Once they are at Morphosource, in Arctos they can be linked to the appropriate catalog record using the other identifier "Morphosource" ID allowing people to easily get to the scan from the Arctos catalog record. I would recommend that in the Morphosource record you put the url for the catalog item in the "external object url"

image

This will allow users in Morphosource to quickly get to the Arctos record that includes details about the physical specimen.

I am copying Dusty here because he may disagree or have a better solution. I am also going to add this to one of the Github issues because it is an important question that others will probably have as well!

Adios,

Teresa J. Mayfield-Meyer

Jegelewicz commented 2 years ago

@ebraker @ccicero thoughts on Jeff's question?

dustymc commented 2 years ago

Yes the UI-based tool has restrictions.

Arctos certainly allows treating MorphoSource (or anything else) as the primary/only Media data. I don't disagree with anything said, but if they were my data I'd try to find a place for another copy. (Tape at TACC is stable and relatively cheap, but there are long-term costs associated with any stable long-term storage.)

ebraker commented 2 years ago

I agree that at this point in time, loading CT media and linking to MorphoSource from Arctos is the most straightforward approach. I think eventually many of us with CT media would like to host through Arctos via TACC, but need to figure out storage costs and file transfer methods and download actions. My CT scans are generally around 20 GB (upwards to 40GB) each - I'm not sure what it would like like for an end user to access that size of a file. The benefit of MorphoSource is they have a download module which gives the institution the ability to control how users access files - free download vs. must request permission to download (with fields to enter brief description of their project), vs. private media simply hosted at MorphoSource. I think we can replicate some sort of layered download permissions approach in Arctos...right now if you load media at TACC and link to Arctos, it is publicly accessible (which is great!), but you may want to have a more restrictive model to better track usage via loans and also know the specific users downloading media since there may licensing issues to watch (limits on 3D printing, commercial use, etc.).

MorphoSource also allows outside users to manipulate downloaded media and reupload and link the derivatives to the parent media which is nice (e.g., segmenting out a snake skull from an original full body CT scan, or creating a surface mesh, etc.).

Jegelewicz commented 2 years ago

@jebrad see the responses here and let me know if I can help your student!

jebrad commented 2 years ago

This is very helpful indeed, thanks all especially @Jegelewicz - We are working on the morphosource approach but keeping Dusty's concerns in mind. Jeff

Jegelewicz commented 1 year ago

@ebraker is this request still open? What do we need to do?

ebraker commented 1 year ago

Kinda sorta. My wish is that MorphoSource would mint ARKs for the biological specimen records in addition to their media records, but I'm not sure it will happen. The biological specimen pages summarize ALL media linked to an individual, which would be ideal for other_IDs. Closing.

image

dustymc commented 1 year ago

https://github.com/ArctosDB/arctos/issues/3847#issuecomment-1015567362 wasn't completed and this should not have been closed.

I'll change https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcoll_other_id_type#morphosource to use https://www.morphosource.org/biological_specimens/

I will then set up a bot to find Arctos stuff in MS, create the otherID if it doesn't exist, and potentially add eg https://n2t.net/ark:/87602/m4/469370 as Media.

This will not fix existing broken links, and I think every single link in Arctos is broken (identifiers don't seem to be the right flavor).

Collections will need to grant access to the bot if they want the automation (and AFAIK the one existing bot has never been been used, despite all the discussion that lead to it existing).

Here's existing data - the bulk unloader/loader should be capable of fixing them, or you can just unload and turn the bot on.

select
    guid,
    other_id_prefix,
    other_id_number,
    other_id_suffix
from
    flat
    inner join coll_obj_other_id_num on coll_obj_other_id_num.collection_object_id=flat.collection_object_id and
    other_id_type ='Morphosource'
    order by guid;

      guid       | other_id_prefix | other_id_number | other_id_suffix 
-----------------+-----------------+-----------------+-----------------
 MSB:Mamm:309389 |                 |           32504 | 
 MSB:Mamm:94268  | S               |           45218 | 
 MVZ:Bird:127163 |                 |           18488 | 
 MVZ:Bird:127168 |                 |           18489 | 
 MVZ:Bird:138066 |                 |           18490 | 
 MVZ:Herp:100404 |                 |            3745 | 
 MVZ:Mamm:106555 |                 |           11318 | 
 MVZ:Mamm:114370 |                 |           10788 | 
 MVZ:Mamm:116834 |                 |           11415 | 
 MVZ:Mamm:132535 |                 |           11315 | 
 MVZ:Mamm:144306 |                 |           11312 | 
 MVZ:Mamm:144307 |                 |           11314 | 
 MVZ:Mamm:174521 |                 |           11317 | 
 MVZ:Mamm:179796 |                 |           11443 | 
 MVZ:Mamm:183386 |                 |           11295 | 
 MVZ:Mamm:42613  |                 |           11409 | 
 UCM:Herp:41065  |                 |           46656 | 
 UCM:Herp:41070  |                 |           46657 | 
 UCM:Herp:45006  |                 |           46658 | 
 UCM:Herp:67231  |                 |           45486 | 
 UTEP:Ento:11024 |                 |           44755 | 
 UTEP:Ento:1425  |                 |           44734 | 
 UTEP:Ento:17094 |                 |           44756 | 
 UTEP:Ento:17095 |                 |           44757 | 
 UTEP:Ento:17096 |                 |           44759 | 
 UTEP:Ento:17097 |                 |           44760 | 
 UTEP:Ento:3317  |                 |           44735 | 
 UTEP:Ento:3336  |                 |           44738 | 
 UTEP:Ento:4702  |                 |           44741 | 
 UTEP:Ento:5818  |                 |           44742 | 
(30 rows)
mkoo commented 1 year ago

Came here to find this issue: Can you send me an example record with the new MS as media ark?

dustymc commented 1 year ago

I'm not sure what's "new" and not but above - eg https://n2t.net/ark:/87602/m4/469370 - came from prepending a resolver to an ark from https://www.morphosource.org/api/media?physical_object_id=000469363

mkoo commented 1 year ago

Chiming in that i like using the MS specimen ID too since then it's a link that consolidates views of many files (with arks). We were adding that as another Identifier (Morphosource) but they use alphanumerics Firefox_Screenshot_2022-09-26T23-13-07 187Z and only integers are currently allowed (can that be changed?)--> below will generate an error

Firefox_Screenshot_2022-09-26T22-21-36 905Z

dustymc commented 1 year ago

Just use prefix - casting to any kind of numeric strips the leading zeroes (in everything other than whatever they're using, maybe....)

dustymc commented 1 year ago

Code table updated

The linker bot is functional and can be released to production at any time.

I'm going to split the media creation off into a separate bot for various reasons - it'll spread the load/work better with our resources, it'll give collections better control (eg to manually create MS links but allow automagic media creation), it'll handle new media showing up later, etc., just better architecture with a very minimal cost (grant access to a second bot in order to get the full package).

I think I will unload the malformed links and leave them here as CSV before releasing the bot - MS has added a 'fix the padding' handler so the existing links (except the one MSB record that has an 'S' prefix for some reason) do work, but they are not valid identifiers and cannot be used for things like fetching Media (and I'm not sure if they're going to the correct place or not, this environment doesn't seem particularly stable, they should be checked). Mixing those in with what the bot will do seems like a recipe for a giant mess.

@jldunnum @AdrienneRaniszewski @campmlc @ccicero @mkoo @cjconroy @atrox10 @ebraker @Jegelewicz

the malformed records (and SQL to find them) are a couple comments up if you want to fix them, otherwise I'll delete them with next release (and you can set the bot to re-create them or re-create them from the CSV I'll leave here).

ebraker commented 1 year ago

@dustymc UCM is happy to unleash the bot on our dataset...I'll add the bot agent to UCM collections once it is in production

dustymc commented 1 year ago

@ebraker want media too? I can turn one or both on and run them manually for you, it's always nice to have a real-world test of these things.

dustymc commented 1 year ago

Calling this next release.

Granting morphosource_bot access to your collection will result in...

Screen Shot 2022-09-28 at 7 45 35 AM

... and morphosource_media_bot will use that to....

Screen Shot 2022-09-28 at 7 46 45 AM

from https://arctos-test.tacc.utexas.edu/guid/MVZ:Herp:127623

The media loader check does fail fairly often (I think it's Morphosource but could be n2t), those will be in the media bulkloader as...

Screen Shot 2022-09-28 at 7 48 47 AM

... set them to autoload if you see them, or they will try again the next time around.

I'm not sure of the schedule yet, maybe monthly for the identifier link and recheck for more media every 6 months - I'm very open to better suggestions.

ebraker commented 1 year ago

@dustymc Great! Let's do it. I've created media for our existing MS records - will the media bot duplicate these established ARKs? I definitely want the biological specimen bot, and if there isn't a risk of duplication, the media bot will be great moving forward since it will save me from doing my own MS media bulkloads every month.

mkoo commented 1 year ago

yes, MVZ is in! how do we enable the bots ourselves? (just wondering) Feel free to run for us @dustymc Thanks!

dustymc commented 1 year ago

@ebraker thanks, yes that would've made a mess - you used http://n2t.net/ark:..., I used https://n2t.net/ark:... I'll fix that and file more issues.

BUT...

Screen Shot 2022-09-28 at 9 30 49 AM

...the mess would have been easily attributed to agent morphosource_media_bot - nuke everything, fix the bot, let it try again - no problem, and why I'm now happy to set scripts to go bash around in your collections.

@mkoo https://handbook.arctosdb.org/documentation/bot.html - I'll grant MVZ and get things started, should be tonight unless I break something especially 'interesting' today.

mkoo commented 1 year ago

hmm, that page doesnt answer my question. I guess this new bot is too new to see it in Arctos-prod. or see anything of the details of what it does... i'll look to test later

Jegelewicz commented 1 year ago

@mkoo to add a bot, grant it access to your collection like any other agent you grant access. To find the bot's username look for the bots in agents as agent type = bot. image

image

Click the little [ Arctos user ] link

to grant the bot access to a collection

Jegelewicz commented 1 year ago

@dustymc there is currently only 1 bot that we can select from....

dustymc commented 1 year ago

too new

Yup, bots are paranoid, I have to get the big password out to make them. Tonight...

dustymc commented 1 year ago

No more issues because generated columns are awesome, next release will include a unique index on protocol-stripped media URIs, morphosource_media_bot will use it to avoid making messes.

The pair of bots working together sort of accidentally uncovered something that might need more investigation. There's a loan, https://arctos-test.tacc.utexas.edu/guid/MVZ:Herp:111742 was shipped to 'straya 8 years ago, a morphosource record was created, then - ??? Specimens sold on ebay? There are amazing images on floppy disks in some grad student desk?? Who knows, but there's no media in MS so the record ends up with a link and without media which was unexpected and caught my attention. @atrox10

dustymc commented 1 year ago

Nuked data: temp_morphosource_malformed.csv.zip

ebraker commented 1 year ago

@dustymc how often do these bots run? I just loaded some MorphoSource records this morning and want to know when I can double check that they've been ingested and the corresponding media created in Arctos (this is the point where I'd make a media bulkloader, but I will hold off and trust the magic...)

dustymc commented 1 year ago

If you've got the superpowers, the real answer is...

Screen Shot 2022-10-05 at 10 06 06 AM Screen Shot 2022-10-05 at 10 06 48 AM

and you (probably, I hope, usually...) can't break anything by clicking links in there if you don't feel like waiting for the bot.

Let me know how to find a record if something magic didn't happen.

ebraker commented 1 year ago

@dustymc I never edited the task since the bot_morphosource_media runs daily (and that is plenty), however, none of the media posted to MS on 2022-10-05 have been pulled in:

https://arctos.database.museum/guid/UCM:Herp:64759 https://arctos.database.museum/guid/UCM:Herp:67223 https://arctos.database.museum/guid/UCM:Herp:52560 https://arctos.database.museum/guid/UCM:Herp:52515 https://arctos.database.museum/guid/UCM:Herp:48325 https://arctos.database.museum/guid/UCM:Herp:58230 https://arctos.database.museum/guid/UCM:Herp:47446 https://arctos.database.museum/guid/UCM:Herp:41273 https://arctos.database.museum/guid/UCM:Herp:40117 https://arctos.database.museum/guid/UCM:Herp:39997 https://arctos.database.museum/guid/UCM:Herp:39912 https://arctos.database.museum/guid/UCM:Herp:39609 https://arctos.database.museum/guid/UCM:Herp:24616

These should have new 3D meshes alongside the existing CT Tiffs are already created in Arctos (see MS project for corresponding media IDs).

BUT, one thing I wanted to ask is if we can make this bot add some relevant metadata from MS. Usually I add the following: image

I assume it is not possible to customize the bot? I imagine it may be possible to pull in agent and date, but I'd also like to link a Project and generate a description, so I may end up just continuing with media bulkloads...

dustymc commented 1 year ago

I'm checking MS by collection, which works out to be

https://www.morphosource.org/api/physical-objects?q=http://arctos.database.museum/guid/UCM:Herp&per_page=10000

and those aren't in there.

I'm very open to better ideas if you happen to know something I don't!

The bots can definitely be made different (maybe even smarter) but yea, magicking Agents from current data seems a bit optimistic. Hopefully as https://www.gbif.org/new-data-model matures those identifiers will be more shared and MS will come to believe that idigbio occurrences aren't real identifiers, but that's not much of a now-solution.

ebraker commented 1 year ago

Strange, I'm not sure why its not finding these media. Anyhow, for now I'll stick with bulkloads for now so I can populate desired media metadata. I've revoked access for the media bot for UCM collections, but will keep the morphosource identifier bot so at least arks will be pulled into identifiers in addition to my manual media loads.

dustymc commented 1 year ago

It's not finding the "physical-object" (which is in turn used to find Media). This bit fails:

Screen Shot 2022-10-10 at 3 29 22 PM

and this bit (note its a different bot) is never attempted without that identifier:

Screen Shot 2022-10-10 at 3 30 26 PM