RUC-GSAI / Yulan-GARDEN

Official Repository for SIGIR2024 Demo Paper "An Integrated Data Processing Framework for Pretraining Foundation Models"
53 stars 8 forks source link

PPL Error #1

Closed Rookie-Kai closed 4 months ago

Rookie-Kai commented 4 months ago

Hello, author, thank you very much for your open source, which helps me a lot. But I encountered some problems in using it.

When PPL is turned on, bug always causes the program to break, reminding me that my Input path is a folder. Examples of errors are as follows:

Wc: / mnt/afs/Data_preprocess/data: Is a directory. 
2024-05-11 10 lines from 57 begin to sample randomly 07378-Global Logger-INFO-500 / 0 lines from / mnt/afs/Data_preprocess/data.. 
Exception for Bad File at / mnt/afs/Data_preprocess/Yulan-GARDEN/output/data/.tmp/99.jsonl for [Errno 21] Is a directory:'/ mnt/afs/Data_preprocess/data' 
At this point, a large number of empty tmp.json files will be generated. 

Finally, an error was reported. ValueError: Instruction "train" corresponds to no data! FileNotFoundError: Directory /mnt/afs/Data_preprocess/Yulan-GARDEN/output/data/.dedup is neither aDatasetdirectory nor aDatasetDictdirectory.

This happens only when PPL is turned on, and the program can run normally when PPL is closed. In addition, I have updated the latest code.

PhealenWang commented 4 months ago

Dear Rookie-Kai,

Thanks for your attention to our work, we have reproduced and fixed the reported bug.

This bug is caused by our inconsistent Read and Write implementation of Sampler in the pre-compute process for perplexity statistics. We have fix the bug and update the latest code, which has tested by a small series of cases.

Please pull the latest code and check whether the problem is solved. We apologize for the issues the reported bug leading to, and please feel free to talk to us for any other issues.

PhealenWang, Yulan-GARDEN Team

Emanual20 commented 4 months ago

As no further comments are given, this issue will be set as completed. Please feel free to reopen it if you have any other questions.