This is only relevant for caching preprocessed data with HuggingFace datasets when caching is disabled (which means cache to a temporary directory instead of the normal HuggingFace cache, rather than not caching at all) and keep_in_memory is set to False. Basically, this only matters where the dataset/model combo means the preprocessed training dataset is huge (e.g. >500 GB local scratch space in Baskerville jobs).
tempfile.gettempdir() is meant to check the value of some env vars (see here) before using a default OS-dependent location for temporary files if those aren't set. However, its value is cached and seems to be set early after launching python, which means setting the env variable with os.environ in a script doesn't change it. The possible workarounds are:
Setting the dir argument in tempfile.mkdtemp(). But HuggingFace doesn't expose this argument in datasets.arrow_dataset._get_cache_file_path / datasets.fingerprint._TempDirWithCustomCleanup where it would need to be set.
Setting the environment variable outside the script (i.e. doing TMPFILE=/my/dir python my_script.py) - this works but would require modifying the config generation/launch script code
Patching the use of tempfile.gettempdir() in HuggingFace datasets to set a dir argument that we specify
Making a PR to datasets to make the temp dir configurable
Setting/overwriting the value of tempfile.tempdir (which tempfile.gettempdir() uses) directly (this is not recommended per the documentation but does seem to work for our purposes) - this is what I've done for now
This is only relevant for caching preprocessed data with HuggingFace datasets when caching is disabled (which means cache to a temporary directory instead of the normal HuggingFace cache, rather than not caching at all) and
keep_in_memory
is set to False. Basically, this only matters where the dataset/model combo means the preprocessed training dataset is huge (e.g. >500 GB local scratch space in Baskerville jobs).tempfile.gettempdir()
is meant to check the value of some env vars (see here) before using a default OS-dependent location for temporary files if those aren't set. However, its value is cached and seems to be set early after launching python, which means setting the env variable withos.environ
in a script doesn't change it. The possible workarounds are:dir
argument intempfile.mkdtemp()
. But HuggingFace doesn't expose this argument indatasets.arrow_dataset._get_cache_file_path
/datasets.fingerprint._TempDirWithCustomCleanup
where it would need to be set.TMPFILE=/my/dir python my_script.py
) - this works but would require modifying the config generation/launch script codetempfile.gettempdir()
in HuggingFace datasets to set adir
argument that we specifydatasets
to make the temp dir configurabletempfile.tempdir
(whichtempfile.gettempdir()
uses) directly (this is not recommended per the documentation but does seem to work for our purposes) - this is what I've done for now