SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
MIT License
87 stars 26 forks source link

DwC-Importer: Import taxonID/occurrenceID as identifiers for checklist/occurrence #2339

Open LocoDelAssembly opened 3 years ago

LocoDelAssembly commented 3 years ago

Add UI to pick namespace in both settings button but likely when uploading the dataset as well. Possibly allow the user to opt for globally unique IDs (which Identifier::Global subtype is suitable for this? Need new one? @mjy)

Note: For both taxonID and occurrenceID it is recommended to use a globally unique ID, but uniqueness within the dataset is the only desired (or hard?) requirement (not to be confused with the record ID which MUST BE unique within the dataset).

mjy commented 3 years ago

@LocoDelAssembly we need example values to confirm types. We'll get @bpescador to summarize today and maybe here. Only leaf classes of identifiers can be used (not Identifier::Application). There are decent validations on all global identifier types to constrain their use.

We probably need option to select from allowed/recommended types?

bpescador commented 3 years ago

also note that the script developed by Dash is in the repo and can be tested on antweb data. If you want a fresh dataset let me know how to send it.

LocoDelAssembly commented 3 years ago

To prevent pollution of global namespaces I want to support local identifiers first.

To facilitate the import which values would work best to create a namespace for the dataset when it is first uploaded? (Lets think of UI to override suggested values and/or pick existing namespaces for later)

When the DwC import dataset is created the user is required to provide a description (unique within project), so it might be used as part of the namespace name?

LordFlashmeow commented 3 years ago

@LocoDelAssembly Is there a way to specify only one of (catalogNumber|occurrenceID)?

It crashes (messily) if occurrenceID isn't provided, but only catalogNumber is parsed for the namespace.

When I provide both, it creates two identifiers: Identifier::Local::CatalogNumber | CASENT0101694) and Identifier::Local::Import | occurrenceID:CASENT0101694

mjy commented 3 years ago

@LocoDelAssembly I think perhaps we should allow catalogNumber without occurrenceID. The semantics of occurrenceID in TW likely doesn't align with the vast majority of existing occurrenceIDs, unless they are also UUIDs.

LocoDelAssembly commented 3 years ago

catalogNumber is optional, occurrenceId was required to detect which type of DwC-A is, but now also required for Identifier::Local::Import identifier (although it was supposed to forgo setting it up if empty)

Can you provide an example that triggers bad behavior?

LordFlashmeow commented 3 years ago

Here's the same sample file as in the other issue.

ant_small.txt

image

image


And when only the occurrenceID is given, this is what it shows: image

LocoDelAssembly commented 3 years ago

The second one is because TW shows the first identifier (also the reason why in the code I setup occurrenceID AFTER catalogNumber)

Tried your example, and after fixing the non-ISO date in dateIdentified I get this: image (Quotes appear in scientific name authorship because source file have them, the importer is not expecting quoted strings, TAB indicates end-of-column)

Desired behavior would be to show no identifier other than catalog numbers in the blue box at the left of "det."?

LordFlashmeow commented 3 years ago

Correct. It would be ideal if the occurrenceID:CASENT identifier was hidden. We have no use for it, ant it just clutters things up. Could it be possible to use the catalogNumber as the occurrenceID if not given so we don't need to duplicate it?

(We have the namespace CASENT registered with the separator set to NONE, so the catalogNumber displays properly instead of CAS-ENT-CASENT#.)

LocoDelAssembly commented 3 years ago

occurrenceID is supposed to be kind of primary key for DwC records and would be important to have one to aid future data updates. At least some identifier should be available. occurrenceID need not be equal to catalogNumber, just unique within the dataset (but should be something predictable that you'd use again on a future dataset if referring to the same occurrence).

@mjy perhaps evaluate which types of identifiers are eligible to be shown in the title? Identifier::Local::Import should likely be one of the types to be hidden (but still show them in the identifiers lists and as given identifiers in browse collection object)

LordFlashmeow commented 3 years ago

Our SpecimenCodes are unique within the dataset (and refer to the same specimen across all imports).

I guess it make sense to have a field for storing the original value, since the namespace mapping could change in the future and the resulting catalogNumber would no longer match the original one.

I can't think of a simple way to avoid duplicating the data (that works in our case and others).

Thoughts? Maybe a checkbox in settings Create occurrenceIDs from collectionCodes? Maybe format the Identifier::Local::Import record differently in the collection object views? Make it obvious that it's the original, and not necessarily "official" or "preferred".

LordFlashmeow commented 3 years ago

As discussed in the video call today: can we detect if the occurrenceID and collectionCode are identical, and if so don't create a Local::Import record?

mjy commented 3 years ago

@LocoDelAssembly - remind me, are we adding an Import identifer that references the namespace for this import to all objects created, or just the CollectionObject?

LocoDelAssembly commented 3 years ago

CollectionObject with occurrenceID.

debpaul commented 3 years ago

@LocoDelAssembly wrote:

occurrenceID is supposed to be kind of primary key for DwC records and would be important to have one to aid future data updates. At least some identifier should be available. occurrenceID need not be equal to catalogNumber, just unique within the dataset (but should be something predictable that you'd use again on a future dataset if referring to the same occurrence).

To clarify (for the future): not only would occurrenceID be a sort of primary key, it ought to be globally unique as well (likely a UUID).

LocoDelAssembly commented 3 years ago

mmmmhh, yes, the wording is more stronger towards globally unique than taxonID and eventID. In practice GBIF's validator checks uniqueness within dataset.

Summary of ID fields:

occurrenceID -
Identifier http://rs.tdwg.org/dwc/terms/occurrenceID
Definition An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.
Comments Recommended best practice is to use a persistent, globally unique identifier.
Examples http://arctos.database.museum/guid/MSB:Mamm:233627, 000866d2-c177-4648-a200-ead4007051b9, urn:catalog:UWBM:Bird:89776
eventID -
Identifier http://rs.tdwg.org/dwc/terms/eventID
Definition An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set.
Comments
Examples INBO:VIS:Ev:00009375
taxonID -
Identifier http://rs.tdwg.org/dwc/terms/taxonID
Definition An identifier for the set of taxon information (data associated with the Taxon class). May be a global unique identifier or an identifier specific to the data set.
Comments  
Examples 8fa58e08-08de-4ac1-b69c-1235340b7001, 32567, https://www.gbif.org/species/212
locationID -
Identifier http://rs.tdwg.org/dwc/terms/locationID
Definition An identifier for the set of location information (data associated with dcterms:Location). May be a global unique identifier or an identifier specific to the data set.
Comments
Examples https://opencontext.org/subjects/768A875F-E205-4D0B-DE55-BAB7598D0FD1
identificationID -
Identifier http://rs.tdwg.org/dwc/terms/identificationID
Definition An identifier for the Identification (the body of information associated with the assignment of a scientific name). May be a global unique identifier or an identifier specific to the data set.
Comments
Examples 9992