Kaggle / kagglehub

Python library to access Kaggle resources
Apache License 2.0
93 stars 18 forks source link

[RFC] Design proposal for `model_download` custom download path #179

Open KeijiBranshi opened 1 week ago

KeijiBranshi commented 1 week ago

Proposal: Adding output_dir to ModelHttpResolver for Direct Downloading

This proposal outlines a plan to add an output_dir argument to the __call__ method of the ModelHttpResolver class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.

Motivation:

This feature is primarily driven by user-feedback (https://github.com/Kaggle/kagglehub/issues/175) and the need for better integration between kagglehub and other libraries, specifically torchtune (see https://github.com/pytorch/torchtune/issues/1852).

NOTE: torchtune currently uses huggingface_hub for model downloading, leveraging its snapshot_download function and its local_dir argument for specifying custom download locations.

Proposed Changes:

  1. Introduce a Cache class:

    This class will encapsulate the caching logic from cache.py, providing a cleaner interface and enabling the flexible directory structure based on the override_dir argument.

    class Cache:
       def __init__(self, override_dir: Optional[str] = None):
           self.override_dir = override_dir
    
       def get_path(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
           if self.override_dir:
               return os.path.join(self.override_dir, path or "")  # Flat structure
           else:
               return get_cached_path(handle, path)  # Existing nested structure
    
       def get_archive_path(self, handle: ResourceHandle) -> str:
           if self.override_dir:
               return os.path.join(self.override_dir, f"{handle.version!s}.archive")
           else:
               return get_cached_archive_path(handle)
    
       def get_completion_marker_filepath(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
           if self.override_dir:
                return os.path.join(self.override_dir, f"{path or handle.version!s}.complete")
           else:
               return _get_completion_marker_filepath(handle, path)
    
       def load_from_cache(self, handle: ResourceHandle, path: Optional[str] = None) -> Optional[str]:
           """Return path for the requested resource from the cache or output_dir."""
           marker_path = self.get_completion_marker_filepath(handle, path)
           full_path = self.get_path(handle, path)
           return full_path if os.path.exists(marker_path) and os.path.exists(full_path) else None
    
       # Migrate other relevant methods from cache.py (e.g., mark_as_complete, delete_from_cache)  to the Cache class, adapting them similarly.
  2. Modify ModelHttpResolver.__call__:

    Add the output_dir argument and instantiate the Cache class based on its value.

    def __call__(
       self, h: ModelHandle, path: Optional[str] = None, *, force_download: Optional[bool] = False, output_dir: Optional[str] = None
    ) -> str:
       cache = Cache(override_dir=output_dir)
       # ...use cache.get_path(), cache.get_archive_path(), cache.load_from_cache(), and cache.get_completion_marker_filepath()...
  3. Pass output_dir through model_download:

    def model_download(
       # ...existing arguments
       output_dir: Optional[str] = None,
    
    ):
       # Existing code...
       return registry.model_resolver(h, path, force_download=force_download, output_dir=output_dir)
  4. Handle output_dir in other Resolvers:

    Add warning logs to other resolvers to inform users that the output_dir argument is not supported and will be ignored.

Advantages:

Request for Feedback:

This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:

rosbo commented 1 week ago

Cache class design: Is this the best way to encapsulate the caching logic? Are there alternative approaches?

Makes sense to me and saves us from passing the output_dir to each method called.

Flat directory structure: Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?

I would follow established pattern. torchtune integrates with HF. I would follow what they do.

Error handling: How should errors related to output_dir (e.g., permissions issues) be handled?

We should bubble up permission errors (readonly directory, non-existing folder) to the caller so they can troubleshoot (create missing dir, chmod, etc.).

Interaction with existing cache: If output_dir is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?

If the user passes output_dir, it should not read from the default cache. We should perform a fresh download.

Naming: I chose output_dir to align with torchtune. But download_path was suggested in https://github.com/Kaggle/kagglehub/issues/175. This may or may not make the intention clearer.

I like output_dir, if we ever decide to add support for it for the Kaggle / Colab cache (we would mount the cached file at the output_dir location), then "output_dir" makes more sense to "download_path"