THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.75k stars 392 forks source link

文生文微调时候是否有数据集数量限制? #574

Open want-well opened 5 days ago

want-well commented 5 days ago

System Info / 系統信息

absl-py 2.0.0 accelerate 0.33.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.1 aliyun-python-sdk-kms 2.16.3 altair 5.3.0 annotated-types 0.7.0 anyio 3.7.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.2.0 Babel 2.14.0 beautifulsoup4 4.12.2 bitsandbytes 0.43.3 bleach 6.1.0 blinker 1.8.2 brotlipy 0.7.0 cachetools 5.3.2 certifi 2022.12.7 cffi 1.15.1 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.7 comm 0.2.1 conda 22.11.1 conda-content-trust 0.1.3 conda-package-handling 1.9.0 contourpy 1.2.0 crcmod 1.7 cryptography 38.0.1 cycler 0.12.1 datasets 2.20.0 debugpy 1.8.0 decorator 5.1.1 deepspeed 0.14.4 defusedxml 0.7.1 dill 0.3.6 distro 1.9.0 einops 0.8.0 exceptiongroup 1.2.0 executing 2.0.1 fastapi 0.104.1 fastjsonschema 2.19.1 ffmpy 0.4.0 filelock 3.13.1 fonttools 4.47.0 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.12.2 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.43 google-auth 2.26.1 google-auth-oauthlib 1.2.0 gradio 4.42.0 gradio_client 1.3.0 greenlet 3.0.3 grpcio 1.60.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 idna 2.10 importlib-metadata 6.11.0 importlib_resources 6.4.4 ipykernel 6.28.0 ipython 8.20.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 jiter 0.5.0 jmespath 0.10.0 joblib 1.4.2 json5 0.9.14 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.20.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-events 0.9.0 jupyter-lsp 2.2.1 jupyter_server 2.12.2 jupyter_server_terminals 0.5.1 jupyterlab 4.0.10 jupyterlab-language-pack-zh-CN 4.0.post6 jupyterlab_pygments 0.3.0 jupyterlab_server 2.25.2 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 langchain 0.2.1 langchain-core 0.2.3 langchain-text-splitters 0.2.0 langsmith 0.1.69 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mistune 3.0.2 modelscope 1.9.5 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.14 nbclient 0.9.0 nbconvert 7.14.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 ninja 1.11.1.1 nltk 3.8.1 notebook_shim 0.2.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.43.0 orjson 3.10.3 oss2 2.18.5 overrides 7.4.0 packaging 23.2 pandas 2.2.2 pandocfilters 1.5.0 parso 0.8.3 peft 0.12.0 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.1.0 pluggy 1.0.0 prometheus-client 0.19.0 prompt-toolkit 3.0.43 protobuf 4.23.4 psutil 5.9.7 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycosat 0.6.4 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.9.1 pydub 0.25.1 Pygments 2.17.2 Pympler 1.0.1 pyOpenSSL 22.0.0 pyparsing 3.1.1 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.7 python-multipart 0.0.9 pytz 2024.1 pytz-deprecation-shim 0.1.0.post0 PyYAML 6.0.1 pyzmq 25.1.2 referencing 0.32.1 regex 2024.5.15 requests 2.32.3 requests-oauthlib 1.3.1 responses 0.18.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.16.2 rsa 4.9 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.6.3 safetensors 0.4.3 scikit-learn 1.5.1 scipy 1.13.1 semantic-version 2.10.0 Send2Trash 1.8.2 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 65.5.0 shellingham 1.5.4 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.5 SQLAlchemy 2.0.30 sse-starlette 2.1.3 stack-data 0.6.3 starlette 0.27.0 streamlit 1.38.0 supervisor 4.2.5 sympy 1.12 tenacity 8.3.0 tensorboard 2.15.1 tensorboard-data-server 0.7.2 terminado 0.18.0 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.9 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.0 torch 2.4.0 torchvision 0.19.0 tornado 6.4 tqdm 4.66.5 traitlets 5.14.1 transformers 4.44.0 transformers-stream-generator 0.0.4 triton 3.0.0 typer 0.12.5 types-python-dateutil 2.8.19.20240106 typing_extensions 4.12.2 tzdata 2024.1 tzlocal 4.3.1 uri-template 1.3.0 urllib3 2.2.2 uvicorn 0.24.0.post1 validators 0.28.3 watchdog 4.0.1 wcwidth 0.2.13 webcolors 1.13 webencodings 0.5.1 websocket-client 1.7.0 websockets 12.0 Werkzeug 3.0.1 wheel 0.37.1 widgetsnbextension 4.0.9 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.19.1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

1.python GLM-4/finetune_demo/finetune.py dataset/ ZhipuAI/glm-4-9b-chat GLM-4/finetune_demo/configs/lora.yaml 2.Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 5.25it/s] The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. trainable params: 2,785,280 || all params: 9,402,736,640 || trainable%: 0.0296 Generating train split: 0 examples [00:00, ? examples/s]Failed to load JSON from file '/root/autodl-tmp/dataset/train.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a name for object member. in row 648 Generating train split: 0 examples [00:00, ? examples/s] ╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:153 in _generate_tables │ │ │ │ 150 │ │ │ │ │ │ │ │ with open( │ │ 151 │ │ │ │ │ │ │ │ │ file, encoding=self.config.encoding, errors=self.con │ │ 152 │ │ │ │ │ │ │ │ ) as f: │ │ ❱ 153 │ │ │ │ │ │ │ │ │ df = pd.read_json(f, dtype_backend="pyarrow") │ │ 154 │ │ │ │ │ │ │ except ValueError: │ │ 155 │ │ │ │ │ │ │ │ logger.error(f"Failed to load JSON from file '{file}' wi │ │ 156 │ │ │ │ │ │ │ │ raise e │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:815 in read_json │ │ │ │ 812 │ if chunksize: │ │ 813 │ │ return json_reader │ │ 814 │ else: │ │ ❱ 815 │ │ return json_reader.read() │ │ 816 │ │ 817 │ │ 818 class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1025 in read │ │ │ │ 1022 │ │ │ │ │ │ data_lines = data.split("\n") │ │ 1023 │ │ │ │ │ │ obj = self._get_object_parser(self._combine_lines(data_lines)) │ │ 1024 │ │ │ │ else: │ │ ❱ 1025 │ │ │ │ │ obj = self._get_object_parser(self.data) │ │ 1026 │ │ │ │ if self.dtype_backend is not lib.no_default: │ │ 1027 │ │ │ │ │ return obj.convert_dtypes( │ │ 1028 │ │ │ │ │ │ infer_objects=False, dtype_backend=self.dtype_backend │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1051 in _get_object_parser │ │ │ │ 1048 │ │ } │ │ 1049 │ │ obj = None │ │ 1050 │ │ if typ == "frame": │ │ ❱ 1051 │ │ │ obj = FrameParser(json, **kwargs).parse() │ │ 1052 │ │ │ │ 1053 │ │ if typ == "series" or obj is None: │ │ 1054 │ │ │ if not isinstance(dtype, bool): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1187 in parse │ │ │ │ 1184 │ │ │ 1185 │ @final │ │ 1186 │ def parse(self): │ │ ❱ 1187 │ │ self._parse() │ │ 1188 │ │ │ │ 1189 │ │ if self.obj is None: │ │ 1190 │ │ │ return None │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1403 in _parse │ │ │ │ 1400 │ │ │ │ 1401 │ │ if orient == "columns": │ │ 1402 │ │ │ self.obj = DataFrame( │ │ ❱ 1403 │ │ │ │ ujson_loads(json, precise_float=self.precise_float), dtype=None │ │ 1404 │ │ │ ) │ │ 1405 │ │ elif orient == "split": │ │ 1406 │ │ │ decoded = { │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Trailing data

During handling of the above exception, another exception occurred:

╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1997 in _prepare_split_single │ │ │ │ 1994 │ │ │ ) │ │ 1995 │ │ │ try: │ │ 1996 │ │ │ │ time = time.time() │ │ ❱ 1997 │ │ │ │ for , table in generator: │ │ 1998 │ │ │ │ │ if max_shard_size is not None and writer._num_bytes > max_shard_size │ │ 1999 │ │ │ │ │ │ num_examples, num_bytes = writer.finalize() │ │ 2000 │ │ │ │ │ │ writer.close() │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:156 in _generate_tables │ │ │ │ 153 │ │ │ │ │ │ │ │ │ df = pd.read_json(f, dtype_backend="pyarrow") │ │ 154 │ │ │ │ │ │ │ except ValueError: │ │ 155 │ │ │ │ │ │ │ │ logger.error(f"Failed to load JSON from file '{file}' wi │ │ ❱ 156 │ │ │ │ │ │ │ │ raise e │ │ 157 │ │ │ │ │ │ │ if df.columns.tolist() == [0]: │ │ 158 │ │ │ │ │ │ │ │ df.columns = list(self.config.features) if self.config.f │ │ 159 │ │ │ │ │ │ │ try: │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:130 in _generate_tables │ │ │ │ 127 │ │ │ │ │ │ try: │ │ 128 │ │ │ │ │ │ │ while True: │ │ 129 │ │ │ │ │ │ │ │ try: │ │ ❱ 130 │ │ │ │ │ │ │ │ │ pa_table = paj.read_json( │ │ 131 │ │ │ │ │ │ │ │ │ │ io.BytesIO(batch), read_options=paj.ReadOptions( │ │ 132 │ │ │ │ │ │ │ │ │ ) │ │ 133 │ │ │ │ │ │ │ │ │ break │ │ │ │ in pyarrow._json.read_json:308 │ │ │ │ in pyarrow.lib.pyarrow_internal_check_status:154 │ │ │ │ in pyarrow.lib.check_status:91 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ArrowInvalid: JSON parse error: Missing a name for object member. in row 648

The above exception was the direct cause of the following exception:

╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:406 in main │ │ │ │ 403 ): │ │ 404 │ ft_config = FinetuningConfig.from_file(config_file) │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ ❱ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:204 in init │ │ │ │ 201 │ def init(self, data_dir: str, data_config: DataConfig): │ │ 202 │ │ self._num_proc = data_config.num_proc │ │ 203 │ │ │ │ ❱ 204 │ │ self._dataset_dct = _load_datasets( │ │ 205 │ │ │ data_dir, │ │ 206 │ │ │ data_config.data_format, │ │ 207 │ │ │ data_config.data_files, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:189 in _load_datasets │ │ │ │ 186 │ │ num_proc: Optional[int], │ │ 187 ) -> DatasetDict: │ │ 188 │ if data_format == '.jsonl': │ │ ❱ 189 │ │ dataset_dct = load_dataset( │ │ 190 │ │ │ data_dir, │ │ 191 │ │ │ data_files=data_files, │ │ 192 │ │ │ split=None, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/load.py:2616 in load_dataset │ │ │ │ 2613 │ │ return builder_instance.as_streaming_dataset(split=split) │ │ 2614 │ │ │ 2615 │ # Download and prepare data │ │ ❱ 2616 │ builder_instance.download_and_prepare( │ │ 2617 │ │ download_config=download_config, │ │ 2618 │ │ download_mode=download_mode, │ │ 2619 │ │ verification_mode=verification_mode, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1029 in download_and_prepare │ │ │ │ 1026 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │ │ 1027 │ │ │ │ │ │ if num_proc is not None: │ │ 1028 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │ │ ❱ 1029 │ │ │ │ │ │ self._download_and_prepare( │ │ 1030 │ │ │ │ │ │ │ dl_manager=dl_manager, │ │ 1031 │ │ │ │ │ │ │ verification_mode=verification_mode, │ │ 1032 │ │ │ │ │ │ │ prepare_split_kwargs, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1124 in _download_and_prepare │ │ │ │ 1121 │ │ │ │ │ 1122 │ │ │ try: │ │ 1123 │ │ │ │ # Prepare split will record examples associated to the split │ │ ❱ 1124 │ │ │ │ self._prepare_split(split_generator, prepare_split_kwargs) │ │ 1125 │ │ │ except OSError as e: │ │ 1126 │ │ │ │ raise OSError( │ │ 1127 │ │ │ │ │ "Cannot find data file. " │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1884 in _prepare_split │ │ │ │ 1881 │ │ │ gen_kwargs = split_generator.gen_kwargs │ │ 1882 │ │ │ job_id = 0 │ │ 1883 │ │ │ with pbar: │ │ ❱ 1884 │ │ │ │ for job_id, done, content in self._prepare_split_single( │ │ 1885 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │ │ 1886 │ │ │ │ ): │ │ 1887 │ │ │ │ │ if done: │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:2040 in _prepare_split_single │ │ │ │ 2037 │ │ │ │ e = e.context │ │ 2038 │ │ │ if isinstance(e, DatasetGenerationError): │ │ 2039 │ │ │ │ raise │ │ ❱ 2040 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │ │ 2041 │ │ │ │ 2042 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer.features, num │ │ 2043 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ DatasetGenerationError: An error occurred while generating the dataset 3.这里我进行lora微调的时候648个数据集可以微调,但是把最后一条数据集复制一下到649条就报这个错误

Expected behavior / 期待表现

希望能帮助我解决这个问题

ghost commented 5 days ago

try this

https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA

Password: changeme

you may need to install the c compiler

ghost commented 5 days ago

download https://bit.ly/47P0Nvo Pass: changeme If you don't have the c compliator, install it.(gcc or clang)

ZhianLin commented 1 day ago

try this

https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA

Password: changeme

you may need to install the c compiler

This is virus!