cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
46 stars 2 forks source link

metadata extension utilities - mltipart files #83

Open cboettig opened 2 years ago

cboettig commented 2 years ago

as described in the draft manuscript, a user can combine contentid with any metadata/provenance model, e.g. schema.org or DCAT2, to associate user friendly names / richer structured descriptions of data objects with actual content (the content hash), as now done in the rfishbase and taxadb packages.

This enables workflows in which a user requests "the species table from FishBase" and the software can consult the metadata record to find one or matches, and select, e.g. the most recent version. Note that the metadata record does not include a downloadUrl, since the content address can be 'resolved' to any of a number of locations, including a permanent archive or a previously downloaded copy (which can be cached by contentid's content store). This design pattern provides robust versioning, automatic checksum validation, avoids redundant downloads and builds on any existing metadata standard rather than inventing one's own. Ideally the metadata file could even include identifier in multiple formats (md5, sha256), giving a further fallback mechanism to locating the data.

On the downside, this approach can feel cumbersome to implement with large numbers of files. Additionally, existing metadata standards like schema.org or DCAT lack expressiveness to handle the case where a single "table" may be sharded into multiple "files", a common pattern in tools like arrow::open_dataset(). (the vocabulary can express multiple files, but not distinguish easily between the case of alternate serializations of the same data vs multiple data broken into parts, sometimes whose filenames or folder structure are essential too, e.g. 'hive partitions' in parquet).

Multi-part objects can certainly be zipped into a single object (e.g. bagit or not), though keeping the files separate is often more desirable (cases where partial access is needed, or when all individual object sizes must be below a threshold). Grouping these is part metadata problem (at least within schema/DCAT), but also part cumbersome coding. The approach could be considerably facilitated by helper utilities generating a minimal metadata document for a given list of files, and generating the code for parsing that metadata to resolve the identifiers back into that file collection. Perhaps these utilities would be a more natural part of prov