ecotaxa / ecotaxa_front

Front end of the EcoTaxa application
Other
6 stars 6 forks source link

Export identification status (validated, dubious, predicted) to DWCA #764

Open jiho opened 2 years ago

jiho commented 2 years ago

Currently, we export only validated objects in DWCA (@grololo06, can you confirm?)

A proposal is underway (by @PatriciaCabrera) to use the DarwinCore field identificationVerificationStatus to indicate the status : "Verified by human", "Dubious according to human", "Predicted by machine".

This maps directly to the statuses in EcoTaxa. 🥳

But an occurrence in the occurrence.txt file of a DWCA (i.e. a line) can only have one identificationVerificationStatus; this means that, to use this field, the abundances/concentrations/biovolumes would need to be summed by sample + taxon + status; then for a taxon that has objects of the three statuses, there would be three lines in occurrences.txt and 3 lines in emof.txt, each the with concentration corresponding to the objects with the given status. Then it would be the responsibility fo the user of the data to decide if he/she wants to sum all three (and risk mistakes), keep only the validated (and risk underestimating concentration), etc.

jiho commented 2 years ago

Also tagging @rubenpp7

PatriciaCabrera commented 2 years ago

Update: To indicate the status of the id, in the DarwinCore field identificationVerificationStatus: in EurOBIS we will not use "Dubious according to human", only: "Predicted by machine" and "Verified by human"

grololo06 commented 1 year ago

Indeed, as of today, what is not Verified by human is just filtered out. I guess that the present issue needs to be exposed to users (via API). E.g. do we want to do it always or as a choice? Are there variations in such choice?

grololo06 commented 1 year ago

Code browsing:

grololo06 commented 1 year ago

Doc browsing:

grololo06 commented 1 year ago

An example with mix of Predicted and Validated occurrences. The corresponding Emofs distinguish the 2 different occurrences inside the same sample.

jiho commented 1 year ago
  • it looks like identifiedBy field is needed for validated images. I guess it's all people involved in identification of any object in this taxon. Could be quite long.

We decide to only mention the latest validator, who has the authority on the validation. This field is therefore used to "know who to blame" 😉 Previous validators will be "thanked" through the co-authorship of the dataset.

Since one occurence corresponds to one or more objects in EcoTaxa, this should be the concatenated list of all validators (separated by | )

  • For not-validated images, identificationReferences has to contain, I guess, some information on the ML used for automatic classification.

When validated, this should be a paper/book. For us it would be the future EcoTaxoGuide. Storing this for each object seems like a waste of bits.

When predicted, the best practices document mentions that it should be a reference to the model. We don't store those and even if we did, they would not guarantee reproducibility.

=> We do not use this field for the moment.

  • associatedMedia is optional but can be filled in for EcoTaxa (url to project+sample)

Giving the links to all images is not realistic. Giving the link to the project is (i) not guaranteed to work forever, (ii) redundant with the link back to EcoTaxa at the level of the whole dataset.

=> We do not use this field for the moment.