Open LocoDelAssembly opened 3 years ago
@LocoDelAssembly we need example values to confirm types. We'll get @bpescador to summarize today and maybe here. Only leaf classes of identifiers can be used (not Identifier::Application). There are decent validations on all global identifier types to constrain their use.
We probably need option to select from allowed/recommended types?
also note that the script developed by Dash is in the repo and can be tested on antweb data. If you want a fresh dataset let me know how to send it.
To prevent pollution of global namespaces I want to support local identifiers first.
To facilitate the import which values would work best to create a namespace for the dataset when it is first uploaded? (Lets think of UI to override suggested values and/or pick existing namespaces for later)
When the DwC import dataset is created the user is required to provide a description (unique within project), so it might be used as part of the namespace name?
@LocoDelAssembly Is there a way to specify only one of (catalogNumber|occurrenceID)?
It crashes (messily) if occurrenceID isn't provided, but only catalogNumber is parsed for the namespace.
When I provide both, it creates two identifiers: Identifier::Local::CatalogNumber | CASENT0101694)
and Identifier::Local::Import | occurrenceID:CASENT0101694
@LocoDelAssembly I think perhaps we should allow catalogNumber without occurrenceID. The semantics of occurrenceID in TW likely doesn't align with the vast majority of existing occurrenceIDs, unless they are also UUIDs.
catalogNumber
is optional, occurrenceId
was required to detect which type of DwC-A is, but now also required for Identifier::Local::Import
identifier (although it was supposed to forgo setting it up if empty)
Can you provide an example that triggers bad behavior?
Here's the same sample file as in the other issue.
And when only the occurrenceID is given, this is what it shows:
The second one is because TW shows the first identifier (also the reason why in the code I setup occurrenceID
AFTER catalogNumber
)
Tried your example, and after fixing the non-ISO date in dateIdentified
I get this:
(Quotes appear in scientific name authorship because source file have them, the importer is not expecting quoted strings, TAB
indicates end-of-column)
Desired behavior would be to show no identifier other than catalog numbers in the blue box at the left of "det."?
Correct. It would be ideal if the occurrenceID:CASENT identifier was hidden. We have no use for it, ant it just clutters things up. Could it be possible to use the catalogNumber as the occurrenceID if not given so we don't need to duplicate it?
(We have the namespace CASENT
registered with the separator set to NONE, so the catalogNumber displays properly instead of CAS-ENT-CASENT#.)
occurrenceID
is supposed to be kind of primary key for DwC records and would be important to have one to aid future data updates. At least some identifier should be available. occurrenceID
need not be equal to catalogNumber
, just unique within the dataset (but should be something predictable that you'd use again on a future dataset if referring to the same occurrence).
@mjy perhaps evaluate which types of identifiers are eligible to be shown in the title? Identifier::Local::Import
should likely be one of the types to be hidden (but still show them in the identifiers lists and as given identifiers in browse collection object)
Our SpecimenCodes
are unique within the dataset (and refer to the same specimen across all imports).
I guess it make sense to have a field for storing the original value, since the namespace mapping could change in the future and the resulting catalogNumber would no longer match the original one.
I can't think of a simple way to avoid duplicating the data (that works in our case and others).
occurrenceID
records from the collectionCode
,it doesn't work for imports that provide the identifier and namespace separately. occurrenceID
and creating a collectionCode from that. Thoughts? Maybe a checkbox in settings Create occurrenceIDs from collectionCodes
? Maybe format the Identifier::Local::Import
record differently in the collection object views? Make it obvious that it's the original, and not necessarily "official" or "preferred".
As discussed in the video call today: can we detect if the occurrenceID and collectionCode
are identical, and if so don't create a Local::Import
record?
@LocoDelAssembly - remind me, are we adding an Import identifer that references the namespace for this import to all objects created, or just the CollectionObject?
CollectionObject with occurrenceID
.
@LocoDelAssembly wrote:
occurrenceID
is supposed to be kind of primary key for DwC records and would be important to have one to aid future data updates. At least some identifier should be available.occurrenceID
need not be equal tocatalogNumber
, just unique within the dataset (but should be something predictable that you'd use again on a future dataset if referring to the same occurrence).
To clarify (for the future): not only would occurrenceID be a sort of primary key, it ought to be globally unique as well (likely a UUID).
mmmmhh, yes, the wording is more stronger towards globally unique than taxonID
and eventID
. In practice GBIF's validator checks uniqueness within dataset.
Summary of ID fields:
occurrenceID | - |
---|---|
Identifier | http://rs.tdwg.org/dwc/terms/occurrenceID |
Definition | An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. |
Comments | Recommended best practice is to use a persistent, globally unique identifier. |
Examples | http://arctos.database.museum/guid/MSB:Mamm:233627, 000866d2-c177-4648-a200-ead4007051b9, urn:catalog:UWBM:Bird:89776 |
eventID | - |
---|---|
Identifier | http://rs.tdwg.org/dwc/terms/eventID |
Definition | An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set. |
Comments | |
Examples | INBO:VIS:Ev:00009375 |
taxonID | - |
---|---|
Identifier | http://rs.tdwg.org/dwc/terms/taxonID |
Definition | An identifier for the set of taxon information (data associated with the Taxon class). May be a global unique identifier or an identifier specific to the data set. |
Comments | |
Examples | 8fa58e08-08de-4ac1-b69c-1235340b7001, 32567, https://www.gbif.org/species/212 |
locationID | - |
---|---|
Identifier | http://rs.tdwg.org/dwc/terms/locationID |
Definition | An identifier for the set of location information (data associated with dcterms:Location). May be a global unique identifier or an identifier specific to the data set. |
Comments | |
Examples | https://opencontext.org/subjects/768A875F-E205-4D0B-DE55-BAB7598D0FD1 |
identificationID | - |
---|---|
Identifier | http://rs.tdwg.org/dwc/terms/identificationID |
Definition | An identifier for the Identification (the body of information associated with the assignment of a scientific name). May be a global unique identifier or an identifier specific to the data set. |
Comments | |
Examples | 9992 |
Add UI to pick namespace in both settings button but likely when uploading the dataset as well. Possibly allow the user to opt for globally unique IDs (which Identifier::Global subtype is suitable for this? Need new one? @mjy)
Note: For both taxonID and occurrenceID it is recommended to use a globally unique ID, but uniqueness within the dataset is the only desired (or hard?) requirement (not to be confused with the record ID which MUST BE unique within the dataset).