dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
21 stars 25 forks source link

Enhance threaded_walk with exclusion of vcs and other subfolders #1086

Open yarikoptic opened 2 years ago

yarikoptic commented 2 years ago

We have https://github.com/dandi/dandi-cli/blob/master/dandi/support/threaded_walk.py which is used during zarr uploads. As now we do have Datalad datasets for (some/most) of zarr folders, we better add exclusion of .git, .datalad, .dotdirs etc folders there as we do in find_files. Ideally RF should then include RFing of find_files to use threaded_walk with those added exclusion features so we gain speed up in places where we use find_files

jwodder commented 2 years ago

@yarikoptic The threaded walk is not used during uploads; it's only used to verify downloads and by dandi digest. Also, the results in https://github.com/con/fscacher/pull/67 indicate that the threaded walk is not an improvement when you're just listing files; its value in Zarr digesting comes from being able to calculate MD5 hashes of files concurrenctly.

yarikoptic commented 2 years ago

hm, interesting... may be I was under wrong impression that it does provide some benefits, will need to recall details better.