Closed yarikoptic closed 2 years ago
@yarikoptic What was the fscacher-using code you were running? I believe this happens when acting on a directory whose entire hierarchy contains too many files for the fingerprints to be persisted quickly.
it was that dandi digest -d zarr-checksum
I was benchmarking on dandi HUB (original description has the path etc). that zarr does have some number of files on the lowest level. for a directory are we adding an entire list of files into the "fingerprint" and not some "hash value" of it?
@yarikoptic Yes, the fingerprint for a directory consists of a sorted list of fingerprints of all files within it.
We might avoid such a warning and gain some speed up overall if it would be some hash of that list right away, or even never store a full list of entries (do we use actual entries anywhere) and incrementally hash it as e.g. hash = md5(f'{hash}{file_hash}')
starting with a file_hash
on first one. Could be some faster one than md5. But this might be slower than just hashing full list of entries once -- please check.
@yarikoptic While we don't use the actual entries, they'll need to be sorted when hashing in order for the hash to be deterministic (especially if we go through with threading), so incremental hashing isn't an option.
so I decided to look for "order independent hash", got to https://stackoverflow.com/questions/30734848/order-independent-hash-algorithm (java oriented) which pretty much boils down that we can use xor
operation (should be quick!) to incrementally grow hash so it would be independent of the order!
trying on #67 state of things
and not sure why it took long since all args here should be quite concise afaik (path). Needs to be investigated on either to be addressed/sped up