OpenBMB / CPM-Live

Live Training for Open-source Big Models
511 stars 40 forks source link

Empty Dataset in distributed mode #331

Closed Jason3900 closed 1 year ago

Jason3900 commented 1 year ago

Hi, I'm trying to fully-finetuning on CPM-ANT+. I followed the instructions provided in readme, using the preprocess_dataset.py to generate the binary data file. But it seems that when world_size > 1 (in distributed mode), the read() method in DistributedDataset will raise an error "Empty Dataset", while the data will be successfully read in single node mode. Could you help me fix it? Thanks. https://github.com/OpenBMB/CPM-Live/blob/e0cee47bed002c0cd2cfdb816ceebb7fdef2edc3/cpm-live/pretrain_cpm_ant_plus.py#L427

Jason3900 commented 1 year ago

BTW, the input of preprocess_dataset.py follow the format you provided. Each line is a json with "task" and "text" as keys.

zh-zheng commented 1 year ago

Probably because the amount of data is small. You can use a smaller block_size (here) when init the DistributedDataset.

Jason3900 commented 1 year ago

Thanks, it works. Hope this will be mentioned in README.