When using a corpus that is relatively big (around 20GB) it is useful to launch zeldarose a first time with less resources in order to save GPU hours, also in infrastructures like Jean Zay, not pinging all nodes for more than 30 minutes makes the job fail. This is the case when one has to tokenize a relatively big corpus.
The workaround of pre-tokenizing the corpus firs used to work, but with te latest version of zeldarose I'm getting the following warning when I try to pre-tokenize:
Dataset text downloaded and prepared to /gpfswork/rech/rcy/uok84lv/.cache/hf-datasets/text/default-1850886023af0077/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.
Parameter 'function'=<function encode_dataset.<locals>.<lambda> at 0x14a92157b280> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
And when I try to launch the pre-training of a RoBERTa model the corpus not found, then it is re-tokenized and so the job always fails as nodes get disconnected waiting.
I have found another workaround for this by downgrading the versions of dill and multiprocess to:
When using a corpus that is relatively big (around 20GB) it is useful to launch
zeldarose
a first time with less resources in order to save GPU hours, also in infrastructures like Jean Zay, not pinging all nodes for more than 30 minutes makes the job fail. This is the case when one has to tokenize a relatively big corpus.The workaround of pre-tokenizing the corpus firs used to work, but with te latest version of
zeldarose
I'm getting the following warning when I try to pre-tokenize:And when I try to launch the pre-training of a RoBERTa model the corpus not found, then it is re-tokenized and so the job always fails as nodes get disconnected waiting.
I have found another workaround for this by downgrading the versions of
dill
andmultiprocess
to:There might also be a related issue already reported to
hf/datasets
: https://github.com/huggingface/datasets/issues/3178Thanks in advance for the help 😄