dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
21 stars 25 forks source link

organize: maintain mapping to original filenames and checksums #70

Open satra opened 4 years ago

satra commented 4 years ago

a request by labs via @bendichter to know which file was converted to the dandiset. perhaps organize can upload this with the file/item metadata and dandi cli can look this up from the server.

however in general perhaps we can add a simple provenance file:

<filename 1> wasDerivedFrom <old filename> .
<filename1> <sha512_or_some_such> <sha> .

this may be useful to create a manifest file later or for checking against amazon store on download.

at present organize doesn't change checksums. so that notion is simply a flag for now.

yarikoptic commented 4 years ago

ATM organize indeed doesn't change files, and I hope it will never have to. We already upload object id from nwb which could be used to track which file was already uploaded. I guess there could be a dedicated function/command to check for each file if it is already known to an uploaded dandiset. Actually might just need to be an option to dandi ls.

Tracking provenance of changes is a much bigger issue, but since out tools don't change files yet, we aren't even in that realm yet, besides tracking uploads - #43

satra commented 4 years ago

but a name change is still a change, and i think that's what the labs were looking for. till we get labs to use dandiset organization locally, it may be nice to provide such an option to lookup old name.

yarikoptic commented 4 years ago

thanks for the clarification. Indeed knowing association would be helpful. I am thinking that adding dandi diff command would make most sense here. Filed a dedicated https://github.com/dandi/dandi-cli/issues/72

satra commented 4 years ago

i think some of this is coming from the fact that @bendichter did the conversion and uploaded, so only he has the symlinks to the original dataset. but in general this would be an issue with any member of the lab

yarikoptic commented 4 years ago

well, people in the lab should have access to "organized" version. As long as we do not introduce options which we do not save along with organized dandiset, organize should be idempotent, so anyone in the lab could (re)organize (with symlinks or not), and rerun dandi upload which should not reupload already present files.

Getting back to diff, I think we could provide output such as

$ dandi diff . http://dandi.../000XXX/drafts
path1            path2
./blah1.nwb  sub-0X/sub-0X.....nwb
./blah2.nwb  sub-0Y/sub-0Y.....nwb   
./blah3.nwb
                    sub-0Z/sub-0Z....nwb 

so you could see which files are the same but renamed in the 2nd path, and which are missing here or there.

satra commented 4 years ago

that's a good starting point for now. will the local path support a tree (or does it have to be flat).

yarikoptic commented 4 years ago

the sky is the limit! since nothing is actually implemented ;) but it should indeed operate across subdirectories as well

yarikoptic commented 4 years ago

It seems that users reporting issues which could be mitigated by having this issue addressed. I will slate it for 0.7.0 for now

yarikoptic commented 2 years ago

FWIW