allenai / open-instruct

Apache License 2.0
1.08k stars 140 forks source link

./scripts/prepare_train_data.sh--throws an error #171

Closed dumeixiang closed 2 days ago

dumeixiang commented 1 month ago

2024-06-07 12:44:35 (3.69 MB/s) - ‘data/raw_train/hard_coded/hard_coded_examples.xlsx.2’ saved [53835/53835]

Processing datasets... Processing super_ni data with default configurations... Processing cot data with default configurations... Processing flan_v2 data with default configurations... Processing dolly data with default configurations... Processing self_instruct data with default configurations... Processing unnatural_instructions data with default configurations... Processing stanford_alpaca data with default configurations... Processing code_alpaca data with default configurations... Processing gpt4_alpaca data with default configurations... Processing sharegpt data with default configurations... Traceback (most recent call last): File "/home/md480/open-instruct/open_instruct/reformatdatasets.py", line 789, in globals()[f"convert{dataset}_data"](os.path.join(args.raw_data_dir, dataset), os.path.join(args.output_dir, dataset)) File "/home/md480/open-instruct/open_instruct/reformat_datasets.py", line 316, in convert_sharegpt_data with open(os.path.join(data_dir, data_file), "r") as fin: FileNotFoundError: [Errno 2] No such file or directory: 'data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_2048.json'

natolambert commented 1 month ago

What platform are you running on @dumeixiang ? I recently tested this but was fixing errors with data that no longer exists? Did the data get pulled from the cloud? Maybe the HF auth wasn't right.

See #156

dumeixiang commented 1 month ago

What platform are you running on @dumeixiang ? I recently tested this but was fixing errors with data that no longer exists? Did the data get pulled from the cloud? Maybe the HF auth wasn't right.

See #156

I'm running the script on Linux. Regarding the data source, the script is supposed to access the dataset from the cloud. However, I'm encountering issues with missing data files. As for the Hugging Face authentication, I've ensured that the HF_TOKEN is correctly set up with the required permissions. I am getting erroor FileNotFoundError: [Errno 2] No such file or directory: 'data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_2048.json' and i've checked files listed as below ``` ls -l data/raw_train/sharegpt/ total 3221532 -rw-r--r-- 1 md480 ldapusers 551420693 Apr 7 2023 sg_90k_part1_html_cleaned.json -rw-r--r-- 1 md480 ldapusers 551420693 Apr 7 2023 sg_90k_part1_html_cleaned.json.1 -rw-r--r-- 1 md480 ldapusers 551420693 Apr 7 2023 sg_90k_part1_html_cleaned.json.2 -rw-r--r-- 1 md480 ldapusers 548183942 Apr 7 2023 sg_90k_part2_html_cleaned.json -rw-r--r-- 1 md480 ldapusers 548183942 Apr 7 2023 sg_90k_part2_html_cleaned.json.1 -rw-r--r-- 1 md480 ldapusers 548183942 Apr 7 2023 sg_90k_part2_html_cleaned.json.2

natolambert commented 1 month ago

@dumeixiang in what directory? It may be that you need to be in the top level directory for the local paths to work.

I checked and the source data is still live here https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered (no commits in the last year).

dumeixiang commented 1 month ago

@dumeixiang in what directory? It may be that you need to be in the top level directory for the local paths to work.

I checked and the source data is still live here https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered (no commits in the last year).

i am running in below directory, will it be considered as top level directory ? : ~/open-instruct$ ./scripts/prepare_train_data.sh

hamishivi commented 1 month ago

Hi, can you give the full output of when you run the script in a fresh environment? Sometimes the splitting script can error and not produce the file that the output you have pasted is complaining about missing. Thanks!

dumeixiang commented 3 weeks ago

Hi, can you give the full output of when you run the script in a fresh environment? Sometimes the splitting script can error and not produce the file that the output you have pasted is complaining about missing. Thanks!

Hi hamishivi, would you mind to be a litle specific in "fresh environment"? i start from requirement and tried the './scripts/prepare_train_data.sh' several times, yet still getting same error messages as mentioned earlier.

hamishivi commented 3 weeks ago

I just mean a new conda environment/virtual environment, with all the packages installed following the README. If you've done that, could you post the full output you get when you run the script? thanks!

dumeixiang commented 3 weeks ago

I just mean a new conda environment/virtual environment, with all the packages installed following the README. If you've done that, could you post the full output you get when you run the script? thanks!

hi hamishivi, i created new conda environment and run the reruirenements.txt, and re-run the prepare_train_data.sh , the original error dispeared, but i encountered new one as below:

~/open-instruct$ python open_instruct/reformat_datasets.py --raw_data_dir data/raw_train/ --output_dir data/processed/ Processing super_ni data with default configurations... Processing cot data with default configurations... Processing flan_v2 data with default configurations... Processing dolly data with default configurations... Processing self_instruct data with default configurations... Processing unnatural_instructions data with default configurations... Processing stanford_alpaca data with default configurations... Processing code_alpaca data with default configurations... Processing gpt4_alpaca data with default configurations... Processing sharegpt data with default configurations... Traceback (most recent call last): File "/home/md480/open-instruct/open_instruct/reformatdatasets.py", line 789, in globals()[f"convert{dataset}_data"](os.path.join(args.raw_data_dir, dataset), os.path.join(args.output_dir, dataset)) File "/home/md480/open-instruct/open_instruct/reformat_datasets.py", line 317, in convert_sharegpt_data examples.extend(json.load(fin)) File "/usr/lib/python3.10/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

hamishivi commented 2 days ago

Hi, we just merged some changes into main cleaning up this script - I would try running it now! For reference, I was able to run this without any issues on our machines:

Processing datasets...
Processing super_ni data with default configurations...
Processing cot data with default configurations...
Processing flan_v2 data with default configurations...
Processing dolly data with default configurations...
Processing self_instruct data with default configurations...
Processing unnatural_instructions data with default configurations...
Processing stanford_alpaca data with default configurations...
Processing code_alpaca data with default configurations...
Processing gpt4_alpaca data with default configurations...
Processing sharegpt data with default configurations...
Processing baize data with default configurations...
Processing oasst1 data with default configurations...
Processing lima data with default configurations...
Waring: example 1021 in LIMA has odd number of messages. Cutting off the last message.
Processing wizardlm data with default configurations...
Processing open_orca data with default configurations...
Processing hard_coded data with default configurations...
Processing science data with default configurations...
Processing tulu_v1 subsets...
Merging all the subsets to create tulu v1...
Processing tulu_v2 subsets...
Waring: example 1021 in LIMA has odd number of messages. Cutting off the last message.
Merging all the subsets to create tulu v2...