feat: Enable JSON dataset compatibility

Description of the change

While JSON documentation exists, there have been issues with loading JSON files, as most testing has been focused on JSONL files. To address this:

Create JSON versions of the JSONL fixtures.
Add JSON fixtures to __init__.py and imports

Add JSON tests to functions leading to the following failures (added 31 total, from 167 -> 198 tests):

tests/acceleration/test_acceleration_dataclasses.py ....                                                                                                              [  2%]
tests/acceleration/test_acceleration_framework.py s.ssssssss                                                                                                          [  7%]
tests/build/test_launch_script.py .........                                                                                                                           [ 11%]
tests/build/test_utils.py .........                                                                                                                                   [ 16%]
tests/test_sft_trainer.py ..................................ss......F......                                                                                           [ 41%]
tests/trackers/test_aim_tracker.py sss                                                                                                                                [ 43%]
tests/trackers/test_file_logging_tracker.py ..                                                                                                                        [ 44%]
tests/trackers/test_tracker_api.py ..                                                                                                                                 [ 45%]
tests/trainercontroller/test_tuning_trainercontroller.py ......................                                                                                       [ 56%]
tests/utils/test_config_utils.py ...............                                                                                                                      [ 64%]
tests/utils/test_data_type_utils.py ...                                                                                                                               [ 65%]
tests/utils/test_data_utils.py ...                                                                                                                                    [ 67%]
tests/utils/test_embedding_resize.py .                                                                                                                                [ 68%]
tests/utils/test_evaluator.py .                                                                                                                                       [ 68%]
tests/utils/test_preprocessing_utils.py .........F.......FFFF..................................F.....                                                                 [100%]

Modify preprocessing_utils.py to handle both JSONL and JSON files based on file extension. The main change is to change the load_hf_dataset_from_jsonl_file function to load_hf_dataset_from_file. When format_dataset is called, if the dataset is not pretokenized or single sequence, it calls this function. The problem was that it only supported JSONL files. By looking at the file extension to determine if it is JSON or JSONL, we can return either a JSON or JSONL object.

Related issue

While JSONL file format works, JSON file format is currently broken at least for a few kinds of data formats we support . https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#1-json-formats-with-a-single-s[…]-use-for-masking-on-completion We have documented JSON, but it doesn't actually work. Investigate why this is broken and fix.

How to verify the PR

Run unit tests added in test_sft_trainer.py, test_preprocessing_utils.py If any new tests need to be added or removed, let me know!

Was the PR tested

[x] I have added >=1 unit test(s) for every new method I have added.
[x] I have ensured all unit tests pass

foundation-model-stack / fms-hf-tuning

feat: Enable JSON dataset compatibility #297

Description of the change

Related issue

How to verify the PR

Was the PR tested