OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.11k stars 819 forks source link

json load dataset takes for a long time #856

Closed biaoliu-kiritsugu closed 2 weeks ago

biaoliu-kiritsugu commented 2 weeks ago

In the following code in src/lmflow/datasets, when loading the dataset using json, it is a very long process for me, is this normal? how to reduce the time of loading the dataset?

for single_file in data_files:
                with open(single_file) as fin:
                    json_data = json.load(fin)
                    if KEY_TYPE not in json_data.keys():
                        raise ValueError(
                            f'"{KEY_TYPE}" field must be specified for data, e.g.'
                            '{\n'
                            f'   "{KEY_TYPE}: "text_only",\n'
                            f'   "{KEY_INSTANCES}": [\n'
                            '       { "text": "Sentence 1: This is a sentence." }\n'
                            '       { "text": "Sentence 2: This is another sentence." }\n'
                            f'   ]\n'
                            '}'
                        )
                    if self.type is None:
                        self.type = json_data[KEY_TYPE]
                    elif self.type != json_data[KEY_TYPE]:
                        raise ValueError(
                            'All task files must have same data types. Previous'
                            f' files have type "{self.type}", but in file'
                            f' {single_file}, it has type "{self.type}".'
                        )
research4pan commented 2 weeks ago

Thanks for your interest in LMFlow! When loading the dataset for the first time, LMFlow needs to tokenized the dataset. After that, the cache for the tokenized dataset will be stored and it should be much faster for later runs. You may also use --preprocessing_num_workers 20 to accelerate the process by parallelism.

Also, the speed could be because the json file is too large. In that case, we recommend splitting the json file into smaller files, each with no more than several megabytes.

Hope this information can be helpful 😄