datalad / datalad-metalad

Next generation metadata handling
Other
12 stars 11 forks source link

Migration tool from metalad <= 0.2.1 metadata to metalad >= 0.3.0 metadata #118

Open christian-monch opened 3 years ago

christian-monch commented 3 years ago

The new version of metalad is not capable of working with metadata from the old version. To continue to use old metadata, it must be converted into the new metalad metadata format, i.e. into a metadata model instance.

This mapping is relatively simple, i.e. all necessary data is available in the old format.

The demand for such a tool is not clear to me.

mih commented 3 years ago

I think the demand clearly exists (crunching through 250TB) of data again is not small task.

The utility for existing datasets, that need to be free'd from metadata blobs in their main branches is also clear.

Those two aspects might be implemented as two individual pieces, though.

Having a conversion can also disentangle the work on extractor from the work on the storage/logistics.

How useful it would be to keep old metadata in terms of what it can tell is a separate question that can only be answered in the light of a specific metadata source.

Bottom line: If it can be done with reasonable effort, it is a good thing to have.

yarikoptic commented 3 years ago

FWIW: I agree that having a way to migrate metadata would be useful! But as for datasets.datalad.org goes - I am afraid that I would need to reextract metadata anyways, and only a small portion of that data has metadata extracted and none of it is really kept up to date ATM. For that, to avoid re-downloading of the entire beast, I hope to get back to https://github.com/datalad/datalad-fuse/tree/master/datalad_fuse and make it possible to sparsely fetch only the data blocks actually needed to get metadata extracted. Yet to check though if it would be feasible for all filetypes (or may be some .gz would need fetching entire thing to get to some metadata record in the tail...)

mslw commented 1 year ago

For someone looking (like me) for an example of "old" (<=0.2.1) metadata, @christian-monch suggested the longnow-podcasts dataset that is used as an example in the DataLad handbook. .datalad/metadata folder contains the "old" metadata (and, as evindent, they are part of the worktree).