RUC-GSAI / Yulan-GARDEN

Official Repository for SIGIR2024 Demo Paper "An Integrated Data Processing Framework for Pretraining Foundation Models"
55 stars 9 forks source link

Some question about this when run the code #2

Closed qylen closed 5 months ago

qylen commented 5 months ago

Hello, your work is so good! I'm a newcomer in this field. While I follow your guide to run the code like this , I download the mini openwebtext2 for input. when I run the code, the output file is empty. I don't know how to solve it. image

qylen commented 5 months ago

Also, How to deal with the input dataset file like openwebtext2? you say you provide the method for data processing recipes, but I can't find it.

Emanual20 commented 5 months ago

Dear qyleni,

Thanks for your attention to our work.

In terms of your first question, as no further details about the errors of running Yulan-GARDEN are provided, I suppose the mini openwebtext2 you downloaded is not in the format of JSONL. To specify the {{input_ext}} in config file, you are supposed to serve each data point in one line, while a json dict containing {{input_text_key}} field in one line.

As for your second question, we provide our data processing recipe for openwebtext2 to reproduce the experiment result reported in our paper can be found in here.

I wish this response could help your figure out your issues.

Emanual20, Yulan-GARDEN Team

Emanual20 commented 5 months ago

As no further comments are addressed, this issue will be closed. Please feel free to reopen it if you have any other questions. We are looking forward to your feedback.