cboettig / rfishbase_board

pins board for rfishbase data
0 stars 0 forks source link

provenance of fishbase/sealifebase parquet files #1

Open jhpoelen opened 1 month ago

jhpoelen commented 1 month ago

Hey @cboettig -

Thanks for continuing to provide access to data managed by fishbase/sealifebase.

I noticed that you are now packaging the data as parquet files.

Can you please elaborate on the provenance on these parquet files? Right now, the parquet files are provided, and no method for constructing them is provided. I am trying to understand the data sources you used to construct these parquet files.

Curious to hear more about your methods. . . and hope all is well,

-jorrit

cboettig commented 1 month ago

Thanks, definitely. My goal is to have the provenance described in schema.org (or maybe dcat), e.g. like so https://github.com/ropensci/rfishbase/blob/6ebd80e92a93366ce6b159eadd94c5e47f06d31e/inst/prov/fb.prov , but I don't have this working properly yet.

Note that this repo is intended as just one of multiple possible locations to stash those bits and bytes, hopefully the sha hashes in the prov resolve here and to other locations (e.g. with contentid or any similar tool).

Roughly speaking though, the FishBase team provides compressed SQL dumps from their database which we don't have permission to redistribute in raw form, but merely export the tables. Previously this was done as csv (DBI connection -> read data.frame into R, readr::write_csv(), more recently as parquet, (previously via arrow, though I think this whole pipeline can be done quite simply in duckdb now). So on a semantic level, these parquet files are "the same as" the corresponding tables from the FishBase database at the time of the snapshot, with the most salient provenance being that timestamp.

You can see in the earlier examples above the prov was being represented as an R script with hash, but in many ways that's possibly not precise enough? e.g. maybe you need to know what version of the arrow package was used. parquet of course supports different compression formats , and arrow libraries can be built with optional support for these, so maybe you also need to know the full provenance of the which compression libraries and what versions, and so on to have the full docker image. of course the docker image doesn't capture everything either, e.g. iiuc the sha calculations are accelerated with dedicated on-chip hardware on most modern chips these days, so perhaps the full provenance captures these hardware details too. From a scientific standpoint, most researchers would want to resolve the provenance of the scientific statements which sometimes is and sometimes isn't extractable from the references contained in the data tables themselves....

anyway as you see I've worked myself into a bit of a rhetorical hole for precisely what provenance to trace so for the moment have stuck with documenting the parquet themselves in an effort to minimize friction in maintaining this...

jhpoelen commented 1 month ago

@cboettig thanks for your prompt reply.

  1. you can't distribute the dump, but you can distribute the content id of the dump right? If so, I'd suggest to shove a "derivedFrom" statement in there, ideally with the agent being a software program that took the dump and converted it into parquet.
  2. while I remember parquet being great to work with in 2016/2018, and I imagine that has gotten more support since then, I'd humbly request to also distribute tsv/csv versions along with the parquet. I have a feeling that this format maybe a little more approachable for most.

As far as not being precise enough - I agree that hashing the entire universe may require a lot of compute power, however, hashing a little R script as a way to hint to what happened in some derivation would be pretty good in my book.

jhpoelen commented 1 month ago

I'd very much like to add sealifebase/fishbase to Nomer's Corpus of Taxonomic Resources to help pour some digital foundation for all the fun tool that we use - for a recent version see

Poelen, J. H. (ed . ) . (2024). Nomer Corpus of Taxonomic Resources hash://sha256/b60c0d25a16ae77b24305782017b1a270b79b5d1746f832650f2027ba536e276 hash://md5/17f1363a277ee0e4ecaf1b91c665e47e (0.27) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12695629