Open KeijiBranshi opened 1 week ago
Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?
Makes sense to me and saves us from passing the output_dir to each method called.
Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?
I would follow established pattern. torchtune integrates with HF. I would follow what they do.
Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?
We should bubble up permission errors (readonly directory, non-existing folder) to the caller so they can troubleshoot (create missing dir, chmod, etc.).
Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?
If the user passes output_dir, it should not read from the default cache. We should perform a fresh download.
Naming: I chose output_dir to align with torchtune. But download_path was suggested in https://github.com/Kaggle/kagglehub/issues/175. This may or may not make the intention clearer.
I like output_dir, if we ever decide to add support for it for the Kaggle / Colab cache (we would mount the cached file at the output_dir location), then "output_dir" makes more sense to "download_path"
Proposal: Adding
output_dir
toModelHttpResolver
for Direct DownloadingThis proposal outlines a plan to add an
output_dir
argument to the__call__
method of theModelHttpResolver
class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.Motivation:
This feature is primarily driven by user-feedback (https://github.com/Kaggle/kagglehub/issues/175) and the need for better integration between
kagglehub
and other libraries, specificallytorchtune
(see https://github.com/pytorch/torchtune/issues/1852).Proposed Changes:
Introduce a
Cache
class:This class will encapsulate the caching logic from
cache.py
, providing a cleaner interface and enabling the flexible directory structure based on theoverride_dir
argument.Modify
ModelHttpResolver.__call__
:Add the
output_dir
argument and instantiate theCache
class based on its value.Pass
output_dir
throughmodel_download
:Handle
output_dir
in other Resolvers:Add warning logs to other resolvers to inform users that the
output_dir
argument is not supported and will be ignored.Advantages:
Cache
class discourages use of global variables, promoting better maintainability and extensibility.Request for Feedback:
This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:
Cache
class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?output_dir
(e.g., permissions issues) be handled?output_dir
is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?output_dir
to align withtorchtune
. Butdownload_path
was suggested in https://github.com/Kaggle/kagglehub/issues/175. This may or may not make the intention clearer.