Closed Jason3900 closed 1 year ago
BTW, the input of preprocess_dataset.py follow the format you provided. Each line is a json with "task" and "text" as keys.
Probably because the amount of data is small. You can use a smaller block_size
(here) when init the DistributedDataset
.
Thanks, it works. Hope this will be mentioned in README.
Hi, I'm trying to fully-finetuning on CPM-ANT+. I followed the instructions provided in readme, using the preprocess_dataset.py to generate the binary data file. But it seems that when world_size > 1 (in distributed mode), the read() method in DistributedDataset will raise an error "Empty Dataset", while the data will be successfully read in single node mode. Could you help me fix it? Thanks. https://github.com/OpenBMB/CPM-Live/blob/e0cee47bed002c0cd2cfdb816ceebb7fdef2edc3/cpm-live/pretrain_cpm_ant_plus.py#L427