Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.45it/s] [INFO|modeling_utils.py:3295] 2023-10-19 06:50:00,157 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[WARNING|modeling_utils.py:3297] 2023-10-19 06:50:00,157 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /home/kings/ChatGLM and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2927] 2023-10-19 06:50:00,158 >> Generation config file not found, using a generation config created from the model config. Quantized to 4 bit Traceback (most recent call last): File "/home/kings/ChatGLM/ptuning/main.py", line 411, in main() File "/home/kings/ChatGLM/ptuning/main.py", line 127, in main model = model.quantize(model_args.quantization_bit) File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/modeling_chatglm.py", line 1191, in quantize self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 155, in quantize layer.self_attention.query_key_value = QuantizedLinear( File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 139, in init self.weight = compress_int4_weight(self.weight) File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 78, in compress_int4_weight kernels.int4WeightCompression( File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/kernels/base.py", line 48, in call func = self._prepare_func() File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/kernels/base.py", line 36, in _prepare_func curr_device = cudart.cudaGetDevice() File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 72, in wrapper raise RuntimeError("Library %s is not initialized" % self.name) RuntimeError: Library cudart is not initialized [2023-10-19 06:50:02,731] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 995904) of binary: /home/kings/anaconda3/envs/chatglm2/bin/python3.10 Traceback (most recent call last): File "/home/kings/anaconda3/envs/chatglm2/bin/torchrun", line 8, in sys.exit(main()) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-19_06:50:02 host : my071 rank : 0 (local_rank: 0) exitcode : 1 (pid: 995904) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Expected Behavior 成功运行 ### Steps To Reproduce Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.45it/s] [INFO|modeling_utils.py:3295] 2023-10-19 06:50:00,157 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration. [WARNING|modeling_utils.py:3297] 2023-10-19 06:50:00,157 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /home/kings/ChatGLM and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2927] 2023-10-19 06:50:00,158 >> Generation config file not found, using a generation config created from the model config. Quantized to 4 bit Traceback (most recent call last): File "/home/kings/ChatGLM/ptuning/main.py", line 411, in main() File "/home/kings/ChatGLM/ptuning/main.py", line 127, in main model = model.quantize(model_args.quantization_bit) File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/modeling_chatglm.py", line 1191, in quantize self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device, File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 155, in quantize layer.self_attention.query_key_value = QuantizedLinear( File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 139, in __init__ self.weight = compress_int4_weight(self.weight) File "/home/kings/.cache/huggingface/modules/transformers_modules/ChatGLM/quantization.py", line 78, in compress_int4_weight kernels.int4WeightCompression( File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/kernels/base.py", line 48, in __call__ func = self._prepare_func() File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/kernels/base.py", line 36, in _prepare_func curr_device = cudart.cudaGetDevice() File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 72, in wrapper raise RuntimeError("Library %s is not initialized" % self.__name) RuntimeError: Library cudart is not initialized [2023-10-19 06:50:02,731] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 995904) of binary: /home/kings/anaconda3/envs/chatglm2/bin/python3.10 Traceback (most recent call last): File "/home/kings/anaconda3/envs/chatglm2/bin/torchrun", line 8, in sys.exit(main()) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/kings/anaconda3/envs/chatglm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-19_06:50:02 host : my071 rank : 0 (local_rank: 0) exitcode : 1 (pid: 995904) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Environment ```markdown Linux ``` ### Anything else? 无

THUDM / ChatGLM2-6B

执行train.sh 进行微调报错 #592

Is there an existing issue for this?

Current Behavior

main.py FAILED