epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
529 stars 76 forks source link

error: preprocess.py file error while working on custom data #94

Open toqeer618 opened 8 months ago

toqeer618 commented 8 months ago

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 71, in encode text = data[key] KeyError: 'text' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 201, in main() Special tokens: {'': 32000, '': 32001, '': 32002, '': 32003, '': 32004, '': 1, '': 2}

padded vocab (size: 32005) with 123 dummy tokens (new size: 32128) File "/mpt/Megatron-LLM/Megatron-LLM/tools/preprocess_data.py", line 179, in main for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1): File "/usr/lib/python3.10/multiprocessing/pool.py", line 423, in return (item for chunk in result for item in chunk) File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next raise value KeyError: 'text'