datalad / datalad-metalad

Next generation metadata handling
Other
11 stars 11 forks source link

(Optionally?) .xz ds- files as well #43

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

Use case - HCP: https://github.com/datalad-datasets/human-connectome-project-openaccess/issues/7, where majority of metadata comes from ds- files which in current datalad metadata handling way aren't compressed. May be we could also use .gitattributes to assign configuration per file(s) pattern on what to compress or not. Not sure if metalad's approach to them is different, so may be this issue is not pertinent, wanted to ask.

mih commented 4 years ago

Will need to check. But compression should be the default. Moreover, I'd like to reduce the format difference between dataset vs. file metadata to be as small as possible. Both should be JSON lines. At the moment there can only be a single records at the dataset level, but that is an artificial limitation without a real gain AFAICS now.

yarikoptic commented 4 years ago

But compression should be the default.

the idea for ds- not been compressed was: they should typically be smallish, available with a straight clone of the dataset, compressed by git. Also changes monitored directly by git, that is where

Both should be JSON lines

might impair since then I am not sure how well diff could encompass a few words difference in long lines. But all metadata does go under annex by default now (datalad.metadata.create-aggregate-annex-limit config variable) as of 78e32a4517befc72d7a2743e016ca944c264c95e (0.11.2~10^2) , but that is configurable. Just something to keep in mind (that could be tuned by user, leading to blown up .git/objects for those diffs)

mih commented 4 years ago

the idea for ds- not been compressed was: they should typically be smallish, available with a straight clone of the dataset, compressed by git. Also changes monitored directly by git, that is where

Yes, that was the idea. However, it rarely works out, because then extra-care has to be exercised that no sensitive information ever ever leaks into this file -- something that is impossible to guarantee.