ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Feature Request - Pull in data citation DOIs from GBIF #3482

Closed ebraker closed 2 years ago

ebraker commented 3 years ago

It would be neat if Arctos harvested DOIs from GBIF where Arctos data have been cited in the literature.

For instance, here are 131 citations linked to UCM Herp data: https://www.gbif.org/resource/search?contentType=literature&gbifDatasetKey=8935e64a-f762-11e1-a439-00145eb45e9a

Many of these studies involve analyzing large datasets from GBIF, and would likely produce links to many Arctos collections (e.g., the first pub on the list involves UCM, MVZ, and UTEP specimen data).

I'm open to multiple possibilities for the ingestion method and would be curious to know what others think:

  1. Arctos autogenerates Publications for GBIF DOIs and links associated specimens as "vouchers" (or possibly a new citation type, "data voucher"?)
  2. Arctos pulls GBIF DOIs into Reports so that users could "opt in" to creating a publication to cite their specimens. Since many of these studies involve multiple Arctos collections, if one collection moves forward and creates the Publication, the Report notification would simply alert other users that X number of their specimens could also be linked to that particular Publication_ID.

FYI, GBIF tracks data citations this way: https://www.gbif.org/literature-tracking

Jegelewicz commented 3 years ago

This would be very useful, BUT

Just looking at the first DOI - which included 34 UTEP:Herp records, I can't see any clear path to a citation of individual records in the publication, and there really aren't any other than this:

Map of the continental US showing established populations of Mediterranean House Geckos (Hemidactylus turcicus). Data were assembled from a review of the literature (R. E. Espinoza and G. B. Pauly unpubl. data), and vouchered (specimen or photo) records from HerpMapper, iNaturalist, and GBIFwith an end date of January2020

I wasn't able to download the data from the DOI (got a secure connection timeout), but I guess that would be the way to determine which catnos from UTEP:Herp were involved? There could be a fair bit of processing to get a list of GUIDs out, but seems like it could be doable and given the above, the idea of a different kind of citation - "data voucher" is appealing. How should we define that?

dustymc commented 3 years ago

@Jegelewicz @campmlc @ccicero @mkoo and I talked about this at some point - my part of that's below.


On Thu, Apr 30, 2020 at 12:09 PM Mariel Campbell campbell@carachupa.org wrote:

Can we go to GBIF to COA portal and grab whatever GBIF says for the number of COA citations and cite that? :)

For a proposal, heck yea! If GBIF gets away with bragging about made-up numbers I don't see why we can't too!

https://www.gbif.org/literature-tracking is a bit about where GBIF comes up with these data.

I can't find any way to query publications in their API. https://www.gbif.org/resource/search?contentType=literature&literatureType=journal&relevance=GBIF_USED&publishingOrganizationKey=3988de20-0560-11d8-b851-b8a03c50a862&peerReview=true is "Arctos"

You can clear than and set dataset to select a specific collection

An Annotated Checklist of Fishes of Amami-oshima Island, the Ryukyu Islands, Japan "used" an MVZ:Bird....

You can click on any DOI to see what was downloaded - eg, https://www.gbif.org/occurrence/download/0003822-180508205500799

215,046,618 occurrences downloaded

I can't find one that "cites" less than a million records.

Not sure if this is fun or just depressing.

campmlc commented 3 years ago

I so wish we could figure this out, as we would then have found the holy grail of museum bioinformatics and maybe could get some funding! I always feel like we are saying "yawp" as loud as we can from the top of a dandelion - but nobody hears.

On Tue, Mar 2, 2021 at 9:13 AM dustymc notifications@github.com wrote:

  • [EXTERNAL]*

@Jegelewicz https://github.com/Jegelewicz @campmlc https://github.com/campmlc @ccicero https://github.com/ccicero @mkoo https://github.com/mkoo and I talked about this at some point - my part of that's below.

On Thu, Apr 30, 2020 at 12:09 PM Mariel Campbell campbell@carachupa.org wrote:

Can we go to GBIF to COA portal and grab whatever GBIF says for the number of COA citations and cite that? :)

For a proposal, heck yea! If GBIF gets away with bragging about made-up numbers I don't see why we can't too!

https://www.gbif.org/literature-tracking is a bit about where GBIF comes up with these data.

I can't find any way to query publications in their API. https://www.gbif.org/resource/search?contentType=literature&literatureType=journal&relevance=GBIF_USED&publishingOrganizationKey=3988de20-0560-11d8-b851-b8a03c50a862&peerReview=true is "Arctos"

You can clear than and set dataset to select a specific collection

An Annotated Checklist of Fishes of Amami-oshima Island, the Ryukyu Islands, Japan "used" an MVZ:Bird....

You can click on any DOI to see what was downloaded - eg, https://www.gbif.org/occurrence/download/0003822-180508205500799

215,046,618 occurrences downloaded

I can't find one that "cites" less than a million records.

Not sure if this is fun or just depressing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3482#issuecomment-789023476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBH36UX3RV4MUN5GWS3TBUFCJANCNFSM4YPHBJKA .

dustymc commented 3 years ago

figure this out

There is a great reluctance to just demand good identifiers from the community (see the conversations linked from https://github.com/ArctosDB/internal/issues/86), and GBIF's current licensing scheme seems completely incompatible with "us" doing so. It's hard to see how the larger community might progress while GBIF is encouraging "215,046,618 plants" (https://doi.org/10.15468/dl.yubndf) as a citation (and it's heavily condensed data for those from Arctos).

Sharing only bare-bones data with GBIF (because they disallow any license that includes a "cite this thusly" clause) might be worth exploring, but it would probably result in users just not finding your stuff.

Arctos absolutely allows "cite this thusly" licenses and loan agreements and whatever else we can come up with. The new transaction form is an interesting way to see publications, and it's HARD to find one that cites anything in there. Perhaps we should do much better closer to home before we worry too much about what's happening at GBIF. (And a "developing best practices for loan-junk" workshop, which I think would help with that, still seems fundable.)

dustymc commented 3 years ago

Quick quantification of "HARD":


select 
  count(distinct(loan.transaction_id)) as numberofloans,
  count(distinct(project_trans.project_id)) as numberofprojects,
  count(distinct(project_publication.publication_id)) as numberofpubs,
  count(distinct(citation.publication_id)) as numberofcitingpubs
 from loan
 left outer join project_trans on loan.transaction_id=project_trans.transaction_id
 left outer join project_publication on project_trans.project_id=project_publication.project_id
 left outer join citation on project_publication.publication_id=citation.publication_id
 ;

 numberofloans | numberofprojects | numberofpubs | numberofcitingpubs 
---------------+------------------+--------------+--------------------
         10903 |             2017 |          887 |                334

Yikes.

dustymc commented 2 years ago

Very tentatively tabling, I don't see a workable automagic approach to this, but I'm absolutely interested in better capturing "secondary usage" if possible so please recategorize if I'm missing something or new tools come along or etc.

Perhaps we need some sort of "maybe you want to periodically go crawl around in GBIF and see if you can make sense of any alleged citations" documentation or something??

Jegelewicz commented 2 years ago

I'm gonna reopen this - I've just been on a Tweet storm with David Shorthouse, Rod Page, Deb Paul, Beckett Sterner, Donat Agosti and Tim Robertson (GBIF) about this very thing. See https://twitter.com/dpsSpiders/status/1491809266423971846?s=20&t=qCk1ROe_49JB6_SrstqL7w

A FAIROS grant was brought into the conversation by Beckett

@mkoo @campmlc