ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Add GrSciColl institution/collection identifiers to datasets #3953

Closed acdoll closed 2 years ago

acdoll commented 3 years ago

GBIF has some issue flags related to our institution and collections codes not matching directly to identifiers in GrSciColl. E.g.: image Speaking with Marie Grosjean with GBIF, she suggested including 'institutionid' and 'collectionid' identifiers in our data exports which will ensure that GBIF records are appropriately linked to the correct institution and collection. She pointed me to this FAQ for determining what values to use for these fields. The values for these identifiers can be found on the institution/collection pages on GrSciColl: image Conversely, she said I could work with her to ensure DMNS collections are properly identified on their end, but it seems like something we could do easily enough for all Arctos collections - could be part of the initial portal setup for new collections? Thoughts?

dustymc commented 3 years ago

Thoughts

WHY is this not a simple webservice?! We're making EMLs, why's GrSciColl apparently not using them? We have persistent resolvable collection identifiers (sorta...), I'd be happy to pass those on and they should be sufficient to stitch everything together.

And what precisely do they DO for us; why would we put any effort into this, what's the return? (A major original goal was unambiguous links with GenBank, but that never became functional so GenBank made their own registry - if any serious discussions come of this, it would be immensely useful to rectify that.)

I don't have any objection to adding some more collection IDs and pushing them to the EMLs, and it's easy enough to do so, but expecting collections to maintain three (at least and probably counting...) completely independent registries is not reasonable, and sharing data via IPT is apparently a lot more complex and manual than seems necessary. Is there any way we can leverage this to make things better for everyone, rather than adding some identifiers just because they exist?

@tucotuco was at the meeting that spawned what's now GrSciColl, and @dbloom is (painfully, apparently!) doing whatever it takes to push DWC to the world - please feel free to jump in here.

See also https://groups.google.com/g/gbif-na/c/7DJWqkMNhYQ/m/VxeZslWUAAAJ

Jegelewicz commented 3 years ago

We're making EMLs, why's GrSciColl apparently not using them? We have persistent resolvable collection identifiers (sorta...),

I think @dbloom might be able to chime in here too.

acdoll commented 3 years ago

We have persistent resolvable collection identifiers (sorta...),

That's what I thought, but apparently what they're getting now doesn't match up to their GrSciColl list. Clearly their lists are not great, check out DMNS; lots of repeats of 'DMNS' where each collection code should be unique (should they all be like 'DMNS:Mamm', or just 'Mamm', 'Bird', ...etc?). I could just get them to fix that for DMNS, but it would be great to get this straight for all collections. I would be happy to work with Marie to find a solution based on the the data they are getting now, from EMLs or in the IPT dataset (or other?). It looks like they may have gotten 'DMNS:Mamm' from the "collectionIdentifier" field in the EML file, but the other collections have similar values (e.g., 'DMNS:Bird') that don't appear to have made their list.

Jegelewicz commented 3 years ago

See #3955

dustymc commented 3 years ago

doesn't match up to their GrSciColl list.

http://arctos.database.museum/guid/DMNS:Bird is (sorta...) your "collection identifier." I'm not sending that in - I thought they were demanding "DWC Triplets" (which can't be unique so messes should be expected). I'd not be at all surprised to find that I'm putting the wrong data in LOTS of wrong places, especially in the EML which has no documentation that I can find so is reverse-engineered. Explicit instructions very much appreciated.....

Jegelewicz commented 3 years ago

https://www.gbif.org/ipt

We probably need to watch this and review the manual.....

dustymc commented 3 years ago

Maybe we need non-"sorta" collection IDs: http://test.arctos.database.museum/collection/dmns:inv will work in next release.

I also added that to the detail page (which is just "details" from home.cfm).

Screen Shot 2021-09-23 at 7 33 31 AM

I could redirect the guid version I've been tossing around above as well, but I think it's "cleaner" if /guid/ does one thing and other things use some other /urlbit/.

Jegelewicz commented 3 years ago

detail page (which is just "details" from home.cfm

I'm beginning to feel like that page needs some love and also that you should be able to get to it from any record in the collection rather than it being kinda hidden the way it is now.

Jegelewicz commented 3 years ago

So I watched the video and

  1. How does all of the IPT stuff translate (or does it) to GRSciColl?
  2. I think there is both an eml AND the need to complete all the stuff in the form at the IPT?
  3. I think we should just start having people register and complete this stuff directly at the IPT. People need to understand the level of work required and also be able to make their own updates when necessary. It seems crazy that we tried to make this easier, but apparently just made more work for everyone. If we can't re-generate the eml when changes are made to manage collection, then that process seems a waste of time and effort. Is there any way we can make that not so?
Jegelewicz commented 3 years ago

With regard to 1

there appears to be zero relationship.

DMNS Institution at GBIF - https://www.gbif.org/publisher/a2ef6dd1-8886-48c9-8025-c62bac973cc7 DMNS at GRSciColl - https://www.gbif.org/grscicoll/institution/1757f021-01b8-4d20-a11a-1da09db2d8b2

WHY?!

dustymc commented 3 years ago

or does it

Not that I can tell.

AND the need to complete all the stuff

I'm more or less aware of how it works, but I'm also aware of what was intended and is possible.

stuff directly at the IPT

We demonstrably have trouble getting folks to update the thing they use every day; I still don't see adding to that as practical (and don't forget GenBank). And see below.

we can't re-generate the eml

What?! It's built on demand, the problem isn't making EML, it's that nobody seems able to DO STUFF with the EML - it's apparently just for show. That should change! That's my entire point! There are tools, The Community just isn't using them! I have no idea if that's just ignorance (eg I'm not building the EML correctly), or if there's some sort of development needed (I don't think so???), or ???????????????????

zero relationship...WHY

People typing not-quite-the-same-thing into a whole bunch of forms!

The other usual answer is "cruddy identifiers," and it looks like we may be contributing to that; I can't see any relationship between the EML we're generating and the DWC data we're pushing. Unless someone stops me now-ish, I'm going to change collectionCode in the DWC and collectionIdentifier in the EML to use/share {baseurl}/collection/{guid_prefix}.

Jegelewicz commented 3 years ago
we can't re-generate the eml

What?! It's built on demand, the problem isn't making EML, it's that nobody seems able to DO STUFF with the EML - it's apparently just for show. That should change! That's my entire point! There are tools, The Community just isn't using them! I have no idea if that's just ignorance (eg I'm not building the EML correctly), or if there's some sort of development needed (I don't think so???), or ???????????????????

Apparently, @dbloom makes significant changes to what we generate, that he then has to repeat if we generate a new eml.

But yes, I only use the EML once. I could download an updates EML from Arctos and upload it into an existing resource, but I would lose two thirds of all of the metadata. The stuff I get from Arctos is a good start, but it's only a start. For example, you provide me with a contact name, email, etc..., but much of that stuff doesn't go into the correct fields in the IPT, so I have to move stuff around manually. Then I need to make sure that same info is in three, possibly four, locations throughout the metadata, plus I usually have to add specific information (web pages, phone numbers, etc). Furthermore, if I replace the existing eml with an updated Arctos eml I would have to redo all of the metadata that I don't get from Arctos, such as the mappings to the GBIF publisher, the CC designation, formatting of the map, the taxonomic and temporal scopes, and a whole host of other things. So, yeah, I only use the Arctos EML the first time I create a resource. If that resource was published prior to going into Arctos I don't use the Arctos EML at all.

David Bloom

dustymc commented 3 years ago

Tell me what it should look like and I'll make that happen....

Jegelewicz commented 3 years ago

I don't know @dbloom will have to tell us.

Jegelewicz commented 3 years ago

Also, doesn't @mkoo have permission to edit at GBIF?

dbloom commented 3 years ago

@Jegelewicz What do you mean "have permission to edit at GBIF?" There are many points of entry through which one could edit GBIF related materials.

As for the EML, here is a sample of metadata that is completed and published: http://ipt.vertnet.org:8080/ipt/eml.do?r=uwymv_bird&v=30.59 (if it doesn't open in the browser as XML you should be able to "view page source" to see it properly, or I can send you a document separately). When @dustymc and I discussed this the last time we recognized that there is bunch of stuff in there that doesn't necessarily have a correlate in Arctos, not all collections will have the same metadata fields/content, some of this content is generated by the IPT, and some other fields I will probably need to update manually regardless of what we do, but here it is. Happy to discuss more as needed.

Of course, this has nothing to do with the institutional/publisher metadata in the GBIF Registry. I think they idea to get the Registry and IPT files to work together, but right now, they are separate sets of metadata.

dustymc commented 3 years ago

permission to edit at GBIF

That's no solution, whatever might be intended....

@dbloom I'm not seeing what's functionally different between what I generate (http://test.arctos.database.museum/info/ipt.cfm?guid_prefix=UWYMV%3ABird) and your example. If there's something missing it should be added to Arctos where it can be shared, or added to the generator if it's there, and I think we're all onboard with that (right!?).

I'll go add the orcid, otherwise can you tell me more about what's problematic?

nothing to do with the institutional/publisher metadata

That's the core of what I'm asking to fix, I think (but I can't say I really understand how this all fits together so ???). We've got a fair bit of time in generating EMLs, finding out they don't do anything useful isn't what I had in mind!

Jegelewicz commented 3 years ago

If there's something missing it should be added to Arctos where it can be shared, or added to the generator if it's there, and I think we're all onboard with that (right!?).

I am because

We've got a fair bit of time in generating EMLs, finding out they don't do anything useful isn't what I had in mind!

However, If the idea is that we get everything in GRSciColl correct and that will be the single source of truth, then let's shoot for that. I think the problem right now is there is no direction and we are left completing information in at least three different places. Maybe we should get GBIF in on this conversation? but who?

dustymc commented 3 years ago

GRSciColl correct and that will be the single source of truth,

So to add an address to Arctos, you'd go to GRSciColl, edit stuff there, then - what? I can't pull any more than I can push....

left completing information in at least three different places

If I could pull from GRSciColl when maybe they could just somehow act as part of the agent UI for Arctos, but I don't think that kind of use is on anyone's radar. I definitely agree that we should be doing this one place, but I don't think that's GRSciColl.

dustymc commented 3 years ago

EML generator is now picking up orcid. Example:

        <creator>
        <individualName>
            <givenName>Elizabeth</givenName>
            <surName>Wommack</surName>
        </individualName>
        <organizationName>University of Wyoming Museum of Vertebrates</organizationName>
        <positionName>Staff Curator</positionName>
        <address>
            <deliveryPoint>Berry Biodiversity Conservation Center, 1000 E. University Ave.</deliveryPoint>
            <city>Laramie</city>
            <administrativeArea>WY</administrativeArea>
            <postalCode>82071</postalCode>
            <country>USA</country>
        </address>
        <electronicMailAddress>ewommack@uwyo.edu</electronicMailAddress>
        <electronicMailAddress>ravenseyes@gmail.com</electronicMailAddress>
        <userId directory="http://orcid.org/">https://orcid.org/0000-0002-9172-0120</userId>
    </creator>
mkoo commented 3 years ago

Did they change something at the GBIF registry? I have only been editing/adding new collections but maybe need to review all the arctos ones... also can you see a Suggest link on the site if you log on? I think anyone can do that. Should we get in touch with GBIF regarding these fuzzy matches?

On Thu, Sep 23, 2021 at 1:27 PM dustymc @.***> wrote:

GRSciColl correct and that will be the single source of truth,

So to add an address to Arctos, you'd go to GRSciColl, edit stuff there, then - what? I can't pull any more than I can push....

left completing information in at least three different places

If I could pull from GRSciColl when maybe they could just somehow act as part of the agent UI for Arctos, but I don't think that kind of use is on anyone's radar. I definitely agree that we should be doing this one place, but I don't think that's GRSciColl.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3953#issuecomment-926129658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATH7UN7CSQA7BT2TRANAB3UDOETXANCNFSM5ESDIJTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Jegelewicz commented 3 years ago

Did they change something at the GBIF registry?

Probably as they ingested stuff from iDigBio and Index Herbariorum and did some smashing together with the stuff they already had.

Jegelewicz commented 3 years ago

https://drive.google.com/file/d/19w6vAgjuZSSSF71UgYB7U5uGc_DK63mu/view

Jegelewicz commented 3 years ago

"sorta" collection IDs: http://test.arctos.database.museum/collection/dmns:inv

Better to use the number? CollectionID: 74

Less readable by humans, but also no reason to change it - ever?

I have no opinion, just thought I would bring it up.

dustymc commented 3 years ago

" non-"sorta" "!!

I think those are good identifiers.

Guid_Prefix should be seen as our most sacred possession; there's nothing more stable.

Collection_id is just a key; like all keys, "slightly easier than the alternative" is sufficient reason to change it.

Jegelewicz commented 3 years ago

@dustymc how do I find out what the id number is for a collection?

dustymc commented 3 years ago

Hu?

Jegelewicz commented 3 years ago

Asking for https://github.com/ArctosDB/new-collections/issues/404#issuecomment-915510941 - need to do this for UTEP:Herb

dustymc commented 3 years ago

Test or prod? If prod, what's broken? If test, why? Does that have something to do with this issue??

You can get collection_id from collection, but that approach is going to melt something "interesting" if there are very many records involved.

Jegelewicz commented 3 years ago

test - nothing to do with this except that it is the ID I need. Trying to get records entered by UA herbarium tester to show up so they can see what they did.

Jegelewicz commented 3 years ago

I just did this for one of the UTEP collections - https://registry.gbif.org/collection/d3957974-8fb6-49b2-8983-37b4b5824381?suggestionId=38

But who has time for that times 215?

dustymc commented 3 years ago

But who has time for that times 215?

That's my whole point here!! (And I thought you were arguing that we just have to find the time?!?)

And FWIW "UTEP:Herb" has about zero chance of being unique and doesn't align with what's in the DWC data nor what the EML generator will suggest - it's just not a useful identifier, suggest using the value from https://arctos.database.museum/collection/UTEP:Herb (which happens to be https://arctos.database.museum/collection/UTEP:Herb)

Screen Shot 2021-09-24 at 1 48 08 PM
dustymc commented 3 years ago

@Jegelewicz UTEP:Herb at test seems to have updated - I told ~40K other records that they were current, which was difficult - that's just not an environment which can support the background tasks. You can update singles with eg select update_flat_row (collection_object_id) from flat where guid='UTEP:Herp:123'

Jegelewicz commented 3 years ago

And FWIW "UTEP:Herb" has about zero chance of being unique and doesn't align with what's in the DWC data nor what the EML generator will suggest - it's just not a useful identifier, suggest using the value from https://arctos.database.museum/collection/UTEP:Herb (which happens to be https://arctos.database.museum/collection/UTEP:Herb)

Yeah - but I think we need to agree on that across all collections and be consistent. Kinda waiting to see what falls out here.

dustymc commented 3 years ago

agree on that across all collections

Nobody stopped me....

Jegelewicz commented 3 years ago

Sent to scientific-collections@gbif.org

I am the project coordinator for Arctos and I would like to discuss how we might directly populate entries in GRSciColl for all of the collections in Arctos. We already hold the information included in GRSciColl in our system and we would prefer that our users have the ability to maintain their information in their collection management system and not need to duplicate effort by copying it to GRSciColl.

For example, the University of Texas at El Paso Biodiversity Collections Herbarium:

Arctos Page GRSciColl Page

There is really no reason these two pages should contain significantly different information and we would like to see if we can make the process of keeping them in sync easier for Arctos collection managers.

I'd be happy to meet and discuss possibilities.

Thank you,

Teresa J. Mayfield-Meyer

Jegelewicz commented 3 years ago
agree on that across all collections

Nobody stopped me....

I think that we need to revisit this. Given the definitions, I think we should do this:

NMMNH:Paleo as example

Institution Code - NMMNHS Institution ID - https://www.gbif.org/grscicoll/institution/bcc1478b-1409-43c3-a013-69586aa98753 Collection Code - NMMNH:Paleo Collection ID - https://arctos.database.museum/collection/NMMNH:Paleo

dustymc commented 3 years ago

Collection ID

That would break Dave's scripts, and we're not paying him enough for that.

We're providing a good identifier now, I don't see any point in arbitrarily shuffling more things around. If someone wants to talk to us or throw up an API or something - well, we're easy to find....

Jegelewicz commented 3 years ago

That would break Dave's scripts, and we're not paying him enough for that.

We aren't paying him enough anyway, but that's not an excuse for putting data in the wrong bucket. As it is, our records will still not get matched to a collection. We really need a discussion with GBIF, @dbloom and some Arctos people to decide what should go where because I feel that we are not putting our best foot forward.

For instance:

RecordedBY - A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence. The primary collector or observer, especially one who applies a personal identifier (recordNumber), should be listed first.

Currently not only do we put collector's name, we also put preparators and we are adding in stuff that is not expected

Collector(s): Paul L. Sealey

and we are missing an opportunity by not passing RecordedByID - A list (concatenated and separated) of the globally unique identifier for the person, people, groups, or organizations responsible for recording the original Occurrence.

where we could pass

https://arctos.database.museum/agent/21310396

but even better

https://orcid.org/0000-0002-6440-1634