Modify preprocessing_utils.py to handle both JSONL and JSON files based on file extension. The main change is to change the load_hf_dataset_from_jsonl_file function to load_hf_dataset_from_file. When format_dataset is called, if the dataset is not pretokenized or single sequence, it calls this function. The problem was that it only supported JSONL files. By looking at the file extension to determine if it is JSON or JSONL, we can return either a JSON or JSONL object.
Description of the change
While JSON documentation exists, there have been issues with loading JSON files, as most testing has been focused on JSONL files. To address this:
__init__.py
and importspreprocessing_utils.py
to handle both JSONL and JSON files based on file extension. The main change is to change theload_hf_dataset_from_jsonl_file
function toload_hf_dataset_from_file
. When format_dataset is called, if the dataset is not pretokenized or single sequence, it calls this function. The problem was that it only supported JSONL files. By looking at the file extension to determine if it is JSON or JSONL, we can return either a JSON or JSONL object.Related issue
How to verify the PR
Run unit tests added in
test_sft_trainer.py
,test_preprocessing_utils.py
If any new tests need to be added or removed, let me know!Was the PR tested