con / fscacher

Caching results of operations on heavy file trees
MIT License
0 stars 2 forks source link

from joblib: UserWarning: Persisting input arguments took XXX to run. #68

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago
(dandi-devel) jovyan@jupyter-yarikoptic:/shared/ngffdata$ time dandi digest -d zarr-checksum test64.ngff/0/0/0/

/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 0.53s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 1.30s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
/home/jovyan/fscacher/src/fscacher/cache.py:138: UserWarning: Persisting input arguments took 0.60s to run.
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  ret = fingerprinted(*args, **kwargs_)
... still running

trying on #67 state of things

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ nl -ba cache.py | grep -3 138
   135                      + fprint.to_tuple()
   136                      + (tuple(self._tokens) if self._tokens else ())
   137                  )
   138                  ret = fingerprinted(*args, **kwargs_)
   139              lgr.log(1, "Returning value %r", ret)
   140              return ret
   141

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ git describe
0.1.6-17-g7ec27c1

jovyan@jupyter-yarikoptic:~/fscacher/src/fscacher$ git branch
* gh-66
  master

and not sure why it took long since all args here should be quite concise afaik (path). Needs to be investigated on either to be addressed/sped up

jwodder commented 2 years ago

@yarikoptic What was the fscacher-using code you were running? I believe this happens when acting on a directory whose entire hierarchy contains too many files for the fingerprints to be persisted quickly.

yarikoptic commented 2 years ago

it was that dandi digest -d zarr-checksum I was benchmarking on dandi HUB (original description has the path etc). that zarr does have some number of files on the lowest level. for a directory are we adding an entire list of files into the "fingerprint" and not some "hash value" of it?

jwodder commented 2 years ago

@yarikoptic Yes, the fingerprint for a directory consists of a sorted list of fingerprints of all files within it.

yarikoptic commented 2 years ago

We might avoid such a warning and gain some speed up overall if it would be some hash of that list right away, or even never store a full list of entries (do we use actual entries anywhere) and incrementally hash it as e.g. hash = md5(f'{hash}{file_hash}') starting with a file_hash on first one. Could be some faster one than md5. But this might be slower than just hashing full list of entries once -- please check.

jwodder commented 2 years ago

@yarikoptic While we don't use the actual entries, they'll need to be sorted when hashing in order for the hash to be deterministic (especially if we go through with threading), so incremental hashing isn't an option.

yarikoptic commented 2 years ago

so I decided to look for "order independent hash", got to https://stackoverflow.com/questions/30734848/order-independent-hash-algorithm (java oriented) which pretty much boils down that we can use xor operation (should be quick!) to incrementally grow hash so it would be independent of the order!