Closed biaoliu-kiritsugu closed 2 weeks ago
Thanks for your interest in LMFlow! When loading the dataset for the first time, LMFlow needs to tokenized the dataset. After that, the cache for the tokenized dataset will be stored and it should be much faster for later runs. You may also use --preprocessing_num_workers 20
to accelerate the process by parallelism.
Also, the speed could be because the json file is too large. In that case, we recommend splitting the json file into smaller files, each with no more than several megabytes.
Hope this information can be helpful 😄
In the following code in
src/lmflow/datasets
, when loading the dataset using json, it is a very long process for me, is this normal? how to reduce the time of loading the dataset?