Closed Doraemonzzz closed 6 months ago
After a series of attempts, it seems to have succeeded. I've listed the process below for others to reference.
First, create a new file pile-readymade-local.yaml
, with the content as follows:
# Draw a preprocessed dataset directly from my HF profile.
# This dataset is already tokenized, you "have" to load the correct tokenizer (which happens automatically with data.load_pretraining_corpus)
name: the_pile_WordPiecex32768
name_proc: the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020
sources:
hub:
provider: local
streaming: True
vocab_size: 32768 # cannot be changed!
seq_length: 128 # cannot be changed!
Then, modify line 35 of load_pretraining_corpus
in cramming/cramming/data/pretraining_preparation.py
to:
try:
processed_dataset_dir = cfg_data.name_proc
except:
processed_dataset_dir = f"{cfg_data.name}_{checksum}"
Change the original line 47 tokenized_dataset = datasets.load_from_disk(data_path)
to:
if cfg_data is not None:
tokenized_dataset = datasets.load_dataset(data_path)["train"].with_format("torch")
else:
tokenized_dataset = datasets.load_from_disk(data_path)
Finally, use the following command to train:
python pretrain.py \
name=amp_b8192_cb_o4_final arch=crammed-bert \
train=bert-o4 data=pile-readymade-local
Ok, I'm glad you got it working!
This was never a usecase I had before, given that I have the originals. I'll close this issue for now, but people will be able to find it through the search.
Hi, Jonas. I would like to ask how to load local data. Specifically, I first downloaded the data here, and then I hoped to run the following experiments:
but it seems that the downloaded data cannot be loaded (I also tried to modify the yaml, but all failed).