[Bug] Failing to cache a pre-tokenized corpus

When using a corpus that is relatively big (around 20GB) it is useful to launch zeldarose a first time with less resources in order to save GPU hours, also in infrastructures like Jean Zay, not pinging all nodes for more than 30 minutes makes the job fail. This is the case when one has to tokenize a relatively big corpus.

The workaround of pre-tokenizing the corpus firs used to work, but with te latest version of zeldarose I'm getting the following warning when I try to pre-tokenize:

Dataset text downloaded and prepared to /gpfswork/rech/rcy/uok84lv/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

And when I try to launch the pre-training of a RoBERTa model the corpus not found, then it is re-tokenized and so the job always fails as nodes get disconnected waiting.

I have found another workaround for this by downgrading the versions of dill and multiprocess to:

dill             0.3.4
multiprocess     0.70.12.2

There might also be a related issue already reported to hf/datasets: https://github.com/huggingface/datasets/issues/3178

Thanks in advance for the help 😄

LoicGrobol / zeldarose

[Bug] Failing to cache a pre-tokenized corpus #31