[RFC] Design proposal for `model_download` custom download path

Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading

This proposal outlines a plan to add an output_dir argument to the __call__ method of the ModelHttpResolver class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.

Motivation:

This feature is primarily driven by user-feedback (https://github.com/Kaggle/kagglehub/issues/175) and the need for better integration between kagglehub and other libraries, specifically torchtune (see https://github.com/pytorch/torchtune/issues/1852).

NOTE: torchtune currently uses huggingface_hub for model downloading, leveraging its snapshot_download function and its local_dir argument for specifying custom download locations.

Proposed Changes:

Introduce a Cache class:

This class will encapsulate the caching logic from cache.py, providing a cleaner interface and enabling the flexible directory structure based on the override_dir argument.

class Cache:
   def __init__(self, override_dir: Optional[str] = None):
       self.override_dir = override_dir

   def get_path(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
       if self.override_dir:
           return os.path.join(self.override_dir, path or "")  # Flat structure
       else:
           return get_cached_path(handle, path)  # Existing nested structure

   def get_archive_path(self, handle: ResourceHandle) -> str:
       if self.override_dir:
           return os.path.join(self.override_dir, f"{handle.version!s}.archive")
       else:
           return get_cached_archive_path(handle)

   def get_completion_marker_filepath(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
       if self.override_dir:
            return os.path.join(self.override_dir, f"{path or handle.version!s}.complete")
       else:
           return _get_completion_marker_filepath(handle, path)

   def load_from_cache(self, handle: ResourceHandle, path: Optional[str] = None) -> Optional[str]:
       """Return path for the requested resource from the cache or output_dir."""
       marker_path = self.get_completion_marker_filepath(handle, path)
       full_path = self.get_path(handle, path)
       return full_path if os.path.exists(marker_path) and os.path.exists(full_path) else None

   # Migrate other relevant methods from cache.py (e.g., mark_as_complete, delete_from_cache)  to the Cache class, adapting them similarly.

Modify ModelHttpResolver.__call__:

Add the output_dir argument and instantiate the Cache class based on its value.

def __call__(
   self, h: ModelHandle, path: Optional[str] = None, *, force_download: Optional[bool] = False, output_dir: Optional[str] = None
) -> str:
   cache = Cache(override_dir=output_dir)
   # ...use cache.get_path(), cache.get_archive_path(), cache.load_from_cache(), and cache.get_completion_marker_filepath()...

Pass output_dir through model_download:

def model_download(
   # ...existing arguments
   output_dir: Optional[str] = None,

):
   # Existing code...
   return registry.model_resolver(h, path, force_download=force_download, output_dir=output_dir)

Handle output_dir in other Resolvers:

Add warning logs to other resolvers to inform users that the output_dir argument is not supported and will be ignored.

Advantages:

Flexibility: Users can choose between using the cache or downloading directly to a specific directory. This also allows users to bypass the deeply nested subdirectory structure in the default cache.
Improved Code Organization: Encapsulating caching logic within the Cache class discourages use of global variables, promoting better maintainability and extensibility.

Request for Feedback:

This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:

Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?
Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?
Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?
Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?
Naming: I chose output_dir to align with torchtune. But download_path was suggested in https://github.com/Kaggle/kagglehub/issues/175. This may or may not make the intention clearer.

Kaggle / kagglehub