bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

support remotes with preston.tar.gz and preston-1.tar.gz archives #285

Closed jhpoelen closed 2 months ago

jhpoelen commented 2 months ago

Currently, Preston allows for discovering resources in tar.gz files on remotes.

for instance,

when retrieving content associated with -

hash://sha256/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

using some remote, preston digs out the content from

preston-2a.tar.gz

if exists. The naming convention is preston-[first two hash ids characters].tar.gz .

The reason for this feature is to bundle resources to keep the file count low. For instance, when a remote only provides up to 100 files (like Zenodo), resources can be bundled into these tar balls.

Suggest to support naming conventions:

preston-2.tar.gz (first content hash character)

as well as

preston.tar.gz (all content to be found in this archive).

jhpoelen commented 2 months ago

As of Preston 0.8.5, remotes are queried for "preston.tar.gz" in addition to the "preston-[a-f0-9]{2}.tar.gz" patterns.

For example usage, see

Poelen, J. H. (2024). A biodiversity dataset graph: Biological Associations in TaxonWorks hash://sha256/e4a47c067d6c125da60c9a1b92b5eecdea539cb8666cd3aed99db347ae5b8ed0 hash://md5/686007de79cc2a49ab23fd3debe56e3f (0.3) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11151783

this enables stuff like:

preston clone --remote https://zenodo.org/records/11151783/files

which would use preston.tar.gz to clone the dataset.