Modified requirements.txt : Added transformers lib as the dependency.
Modified data_process.py : Added 'utf-8' encoding format when open files to avoid possible 'codec can't decode' issue.
Modified data_process.py : Refactored sft_to_pretrain() method. Now the method will require .json files which can be directly downloaded from Huggingface repo, instead of .csv.
Modified data_process.py : Modified process_baidu() method. To avoid the memory error when processing the large .json file, I divided it into five smaller batches and saved each one as a binary file.
Several modifications to the data process module: