SpeciesFileGroup / taxonworks_doc

TaxonWorks (https://taxonworks.org) documentation.
https://docs.taxonworks.org
13 stars 13 forks source link

As a curator I want to know the best practices for asserting "unknowable" #36

Open mjy opened 3 years ago

mjy commented 3 years ago

At present the semantics are to assign a Confidence with the definition along the lines of "I am confident and assert that that this attribute on this instance of this class is unknowable". Specific confidence levels that extend this concept to add "why?" are possible, for example:

It is perhaps best to use the fewest possible number of reasons as to why something is unknowable, as it is highly doutbful that curating to a finer granularity will actually result in meaningful broader data integration etc. The principal is, minimize the amount of down-stream re-interpretation you are forcing people to do. Downstream consumers of your assertions (e.g. scientists doing science with your data) are going to operate on a few boolean descisions as to wether or not the data are useful for their needs.

debpaul commented 3 years ago

Hm. See if this paper helps with documenting (unambiguously) what is meant by "unknown." Note that #DiSSCo folks are thinking hard about this and want to standardize use of "unknown" across their network if possible. See

Quentin Groom, Mathias Dillen, Helen Hardy, Sarah Phillips, Luc Willemse, Zhengzhe Wu, Improved standardization of transcribed digital specimen data, Database, Volume 2019, 2019, baz129, https://doi.org/10.1093/database/baz129

Table 2 from their paper (regarding Unknown and incomplete data): Missing data terms Definition Example
unknown The information is not digitally available. Empty value in a digital record of unknown provenance
unknown:undigitized The information is not digitally available. No attempt has been made to digitize it. Empty value in a skeletal record to which data still need to be added from the label
unknown:missing The information is not digitally available. It appeared to be absent during digitization. A value of S.D. used by transcription platforms to indicate the absence of a date value
unknown:indecipherable The information is not digitally available. It appeared to be present during digitization, but failed to be captured. An indication made by a transcriber that they failed to transcribe the information
known:withheld The information is digitally available, but it has been withheld by the provider. A georeferenced record for which coordinate data are available but withheld for conservation considerations
mjy commented 3 years ago

Thanks. All of these are valid assertions, none of these are the assertion of "unknowable" :)

debpaul commented 3 years ago

So, a good one for them to try and add!

All of these are valid assertions, none of these are the assertion of "unknowable" :)

debpaul commented 3 years ago

Hm. unknown:indecipherable might be why something is "unknowable."

mjy commented 3 years ago

Not the same I think. That is data is present, but computers can't infer on it.

I find this somewhat telling. Rather than start with what curators might tell us, and try to get that in the standard, this seem to start with a digital product, and its nature. I.e. the most basic assertion a curator on the ground needs is "I can not do more with this because the physical thing is destroyed". Everything else for them is "bonus".