ATM CachingFileSystem has a single bool option same_names to switch layout of files from /hash to /url-filename and thus does not leave room for "improvement":
Under heavy use of the cache use having a flat tree of files (/hash or /url-filename based) could lead to a very heavy directory so filesystem could become inefficient in listing that directory etc.
A common (look under .git/objects, same approach used by git-annex, girder etc) workaround is to establish leading directories, e.g. for a /hash it could be /hash[:2]/hash[2:4]/hash[4:] path to the file, thus reducing impact on the file system
for url-based path, it could simply be a path constructed from URI components, e.g. for http://domain/p1/p2/filename URL it could become http/domain/p1/p2/filename path, thus allowing to disambiguate between file systems etc, and also avoid conflicts for the same common filename (as I guess would be now with same_names=True).
With above in mind, I think it would have been nice if instead of same_names there was a layout={hash,hashtree,url_filename,url_fullpath} or alike, thus allowing users to switch to most appropriate layout depending on their use case.
ATM
CachingFileSystem
has a singlebool
optionsame_names
to switch layout of files from/hash
to/url-filename
and thus does not leave room for "improvement":Under heavy use of the cache use having a flat tree of files (
/hash
or/url-filename
based) could lead to a very heavy directory so filesystem could become inefficient in listing that directory etc..git/objects
, same approach used by git-annex, girder etc) workaround is to establish leading directories, e.g. for a/hash
it could be/hash[:2]/hash[2:4]/hash[4:]
path to the file, thus reducing impact on the file systemhttp://domain/p1/p2/filename
URL it could becomehttp/domain/p1/p2/filename
path, thus allowing to disambiguate between file systems etc, and also avoid conflicts for the same common filename (as I guess would be now withsame_names=True
).With above in mind, I think it would have been nice if instead of
same_names
there was alayout={hash,hashtree,url_filename,url_fullpath}
or alike, thus allowing users to switch to most appropriate layout depending on their use case.