量化int4遇到的问题

chensiyao12 commented 1 year ago

cpu内存256G，GPU 6张3090

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped)

wenshuop commented 1 year ago

解决了吗

chensiyao12 commented 1 year ago

没有，你也遇到这个问题吗？

wenshuop commented 1 year ago

对很苦恼，增加了虚拟机的内存也不行

GXKIM commented 1 year ago

cpu内存256G，GPU 6张3090

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped)

cpu内存256G，GPU 6张3090

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped) 130B 要是方便的话，可以共享模型吗？130B给的文件只能下载一部分权重

wei-potato commented 1 year ago

我也遇见了，有人解决了么

GXKIM commented 1 year ago

我也遇见了，有人解决了么

我解决了

wei-potato commented 1 year ago

大佬，能教一下怎么解决的么

wenshuop commented 1 year ago

GXKIM @.***>于2023年6月7日周三16:31写道：

我也遇见了，有人解决了么

我解决了

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/160#issuecomment-1580195239, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATLBXLXG2VUZWCVDSTQXL3DXKA34JANCNFSM6AAAAAAYLG2VZU . You are receiving this because you commented.Message ID: @.***>

请大佬讲下吧

rchanggogogo commented 1 year ago

我也遇见了，有人解决了么

我解决了

能分享一下量化后的程序吗

chensiyao12 commented 1 year ago

大佬怎么解决的？

THUDM / GLM-130B

量化int4遇到的问题 #160