bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

why keep track of primary specimen data #49

Closed jhpoelen closed 7 months ago

jhpoelen commented 4 years ago

@dshorthouse via https://github.com/bio-guoda/preston/issues/47#issuecomment-599835705

Playing devil's advocate – why do we care about datasets for primary natural history specimen data? Are these merely a by-product of DwC-A files, a technological response to poor performance when paging through large XML documents? Are datasets for primary specimen data adding value or are they merely artificial, convenient wrappers? A dataset for tertiary or derived data makes sense because it's a citable unit, constructed in support of a publication or sets of publications. But primary specimen data are necessarily volatile. If they remained static, I'd question whether or not there's any activity in the collection or if they have the capacity to respond to user requests. I have yet to see a compelling reason for retaining versions for primary specimen datasets except perhaps to do some digital archaeology and sort out problems.

dshorthouse commented 4 years ago

To be clear, there are of course very good reasons to keep track of primary specimen data – at the level of the specimen. Are there comparable reasons to keep track of primary specimen datasets as a whole – a composite of specimen data – served from those same collections/museums/institutions? What if we lost interest in DwC-A files for whatever reason (eg frictionless data) and began serving specimen datasets through other structures?

jhpoelen commented 4 years ago

why do we care about datasets for primary natural history specimen data?

I care about datasets (in this case DwC-A zip files) because DwC-A appear to be the main method for institutions to publish and share their data beyond their institutional boundaries. And, if I'd like to discuss particular source data with the institution, I prefer using (and citing!) the original, unaltered, data that was shared by that institution.

Are these merely a by-product of DwC-A files, a technological response to poor performance when paging through large XML documents?

I am sure there are many reasons (e.g., technical, social) why DwC-A and their usage turned out the way it did. But perhaps I am not understanding your question. Please elaborate if you feel I didn't address your question.

Are datasets for primary specimen data adding value or are they merely artificial, convenient wrappers?

From where I am standing, datasets (in this case DwC-A) are the unit of publication for specimen data records. An analogy might be that an institution publishes a "phonebook" volume of their specimen data at some interval (weekly, monthly, yearly). Individual entries in this phonebook can be referenced as long as the reference to that phonebook is well-defined and unaltered copies can be accessed.

there are of course very good reasons to keep track of primary specimen data – at the level of the specimen. Are there comparable reasons to keep track of primary specimen datasets as a whole – a composite of specimen data – served from those same collections/museums/institutions?

I can reliably reference a DwC-A version published by an institution. I see no technical limitation to publish records on a specimen level (e.g., a dataset with a single record), but I am noticing that the current practice is to publish many records at once. With this, if I'd like to keep track of primary specimen data on the specimen level, I need to keep track of the datasets in which they live.

What if we lost interest in DwC-A files for whatever reason (eg frictionless data) and began serving specimen datasets through other structures?

If we lost interest in DwC-A, then I imagine that specimen data will be tracked via alternate data formats while keeping their DwC-A ancestors around. This is why dataset tracking method should be content agnostic, just like git is agnostic to what content is being tracked. Preston offers such a method.

dshorthouse commented 4 years ago

How do you mean git is agnostic to what content is being tracked? Is it not the exact opposite? SHA-1 hashes are created from the contents of directories. Change a word in a tracked text file, a bit in an image, metadata in a DwC-A (eg contact info.), sharpen an OCR file, etc. and the SHA-1 hash changes. If we accept the fact that storage is finite (as is human capacity to make use of versions) and decisions about what to keep are made in the margins of a budget sheet, is it not better to prescribe why & what are the mission-critical changes worth preserving and what are less important?

jhpoelen commented 4 years ago

@dshorthouse thanks for your questions. I'll try to address them below:

How do you mean git is agnostic to what content is being tracked? Is it not the exact opposite? SHA-1 hashes are created from the contents of directories. Change a word in a tracked text file, a bit in an image, metadata in a DwC-A (eg contact info.), sharpen an OCR file, etc. and the SHA-1 hash changes.

git is agnostic to the kind of data that is tracked by it. Perhaps this is why the tagline of git is the stupid content tracker. And yes, any change, even a single bit change or newline, is recorded because , as you noted, git uses sha1 content hashes as opposed to, for instance, semantic hashes.

If we accept the fact that storage is finite (as is human capacity to make use of versions) and decisions about what to keep are made in the margins of a budget sheet, is it not better to prescribe why & what are the mission-critical changes worth preserving and what are less important?

I think the decision when, what, and where, to publish is up to the (dataset) publisher. Similarly, the decision of which dataset to use is up to the consumer of that data. It so happens to be that our Preston observatories keep track of biodiversity dataset registries like GBIF and iDigBio by taking inventories every month or so. And, by doing this, we've shown that these registered datasets are published using widely varying strategies. The main point, however, is to establish a method to reliably reference datasets and the context (e.g., data network) they were found in.

Curious to hear any remaining thoughts, questions or comments.

jhpoelen commented 2 years ago

@dshorthouse @qgroom here's an example of why we should keep track of primary specimen data:

https://github.com/jhpoelen/specimen-image-index

and associated discussions

https://github.com/jhpoelen/specimen-image-index/issues/1 https://github.com/bio-guoda/preston/issues/168

and associated data publication:

Poelen, Jorrit H., & Groom, Quentin. (2022). Preserved Specimen Records with Still Images Registered Across Biodiversity Data Networks in Period 2019-2022 hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 (0.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7032574

with recent pre-print by @mielliott providing some (conceptual/technical) framework

Elliott, M. J., Poelen, J. H., & Fortes, J. (2022, August 29). Signed Citations: Making Persistent and Verifiable Citations of Digital Scientific Content. https://doi.org/10.31222/osf.io/wycjn

Curious to hear how your thoughts about this have developed over the last couple of years.

qgroom commented 2 years ago

It will be good to get this use case published, and am looking for other useful applications. It is interesting that you have tagged @dshorthouse, because there are certainly somethings we can do with recordedBy and recordedById. I like the idea of profiling collectors by the observations they make. Using a data streaming approach might be a good way to do this so that it would be easily repeatable.

dshorthouse commented 2 years ago

Apologies for the slow response; I've been stewing on this & @qgroom just triggered a thought. I'd like to see some mechanism to assess how identifiedBy, dateIdentified, and scientificName are being used. An assessment that tracks temporal shifts in combinations of these three terms would offer us a lot of insight in data publisher practices & most certainly would have downstream implications for trustworthiness. So, here are two scenarios that always struck me as problems in either data integrity or communications of intent:

I suppose other combinations of these three terms through many snapshots could be revealing. Do botanists have different practices than do entomologists? What proportion of specimen-based records across all snapshots has had scientificName changed at least once? Are these nomenclatural or taxonomic changes? The unfortunate part of this is the general lack of content in these three terms, which is itself revealing in how we communicate our specimen-based science.

qgroom commented 2 years ago

@dshorthouse interesting ideas. I wonder how much the results of determination histories would be correlated with the collection management system.

One of the cool things is that one can look at trends with time and with extrapolation make predictions of where we are going in the future. Next year we have a project starting where we have to characterize taxonomic activity in Europe. This will mostly be done with bibliographic databases, but collecting activity is also relevant.

dshorthouse commented 2 years ago

One of the cool things is that one can look at trends with time and with extrapolation make predictions of where we are going in the future. Next year we have a project starting where we have to characterize taxonomic activity in Europe. This will mostly be done with bibliographic databases, but collecting activity is also relevant.

Indeed. "How long does it take for type specimens to be made available on GBIF post-publication, directly from the collections that curate them?"...would be a good question to ask. And secondarily, "How well do the specimen metadata correspond to between the two sources?"

qgroom commented 2 years ago

Indeed. "How long does it take for type specimens to be made available on GBIF post-publication, directly from the collections that curate them?"...would be a good question to ask. And secondarily, "How well do the specimen metadata correspond to between the two sources?"

Good questions, but I wonder if it can be atomized enough so that the pinch points can be recognized. I'm guessing that the time is actually quite short when the material citation is mobilised through Plazi.

dshorthouse commented 2 years ago

I'm guessing that the time is actually quite short when the material citation is mobilised through Plazi.

...you mean when the material citation is correctly and completely mobilized through Plazi. Doubtful if Plazi populates identifiedBy in the data it mobilises. These would have to be inferred based on the authors of the treatment; it's not typically present in materials examined.

jhpoelen commented 2 years ago

@dshorthouse Yes, to err is human: I would expect that mistakes or incomplete references appear at all stages of data publication (including transcription). And, Plazi is keeping track of what they mobilize (see https://github.com/plazi/treatments-xml ), so that suspicious records can be traced to their origin and analyzed. I've had some great recent experiences with Plazi folks like @flsimoes @myrmoteras and collaborators like @ajacsherman in which the expert Aja found/reported suspicious records and @flsimoes traced their origin and, when possible, applied (versioned) corrections (e.g., https://github.com/jhpoelen/hmw/issues/10 ).

So, yes, incorrect and incomplete citations are expected and it takes a village to help review, find, and address these imperfections.

jhpoelen commented 7 months ago

Closing stale discussion on why keeping digital originals is useful.