Use custom cache dir for tokenizer download, too

erickpeirson commented 2 weeks ago

Presently, passing cache_dir: Path to WordLlama.load() has no impact on the cache directory where tokenizer assets are stored. This makes it impossible to use WordLlama in an environment where the default cache path (the user's home directory) is not writable, which is often the case in production scenarios.

This PR does two things:

Modifies the meaning of cache_dir parameter on the WordLlama.load() method to be the cache root directory, within which the tokenizers and weights subdirectories are created;
Ensures that the cache_dir is passed to check_and_download_tokenizer and used, so that all writes occur within a configurable cache directory;

Note that this will effectively bust the cache on upgrade. But I'm hoping that's a small price to pay for the fix.

dleemiller commented 2 weeks ago

Nice - definitely a necessary change for deploying to places like lambda functions. Thanks!

dleemiller commented 2 weeks ago

https://github.com/dleemiller/WordLlama/pull/42

I have decided to clean everything up and simplify the API by removing the weights_dir as well. That feels legacy and over-complicated to me now to have both keyword arguments.

dleemiller / WordLlama

Use custom cache dir for tokenizer download, too #41