xxuyyuan commented 1 year ago

(base) root@6633711ec9b0:/home/data/VisualGLM-6B# bash finetune/finetune_visualglm_qlora.sh NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora Setting ds_accelerator to cuda (auto detect) [2023-06-12 06:33:33,961] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-06-12 06:33:34,032] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora Setting ds_accelerator to cuda (auto detect) [2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-06-12 06:33:35,958] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6 [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.12.10-1 [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1 [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6 [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-06-12 06:33:35,959] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1 [2023-06-12 06:33:35,959] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-06-12 06:33:35,959] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-06-12 06:33:35,959] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-06-12 06:33:35,959] [INFO] [launch.py:163:main] dist_world_size=1 [2023-06-12 06:33:35,959] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

issues

bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so... [2023-06-12 06:33:39,415] [INFO] using world size: 1 and model-parallel size: 1 [2023-06-12 06:33:39,415] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) 16666 [2023-06-12 06:33:39,417] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-06-12 06:33:39,418] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-12 06:33:39,418] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-12 06:33:39,418] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-06-12 06:33:39,418] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-06-12 06:33:39,419] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-06-12 06:33:39,419] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /opt/conda/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 attention with lora replacing layer 14 attention with lora replacing chatglm linear layer with 4bit [2023-06-12 06:34:26,500] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-06-12 06:34:30,185] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 180, in model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args) File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 216, in from_pretrained load_checkpoint(model, args, load_path=model_path, prefix=prefix) File "/opt/conda/lib/python3.10/site-packages/sat/training/model_io.py", line 208, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1657, in load_state_dict load(self, state_dict) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load load(child, child_state_dict, child_prefix) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load load(child, child_state_dict, child_prefix) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1645, in load load(child, child_state_dict, child_prefix) [Previous line repeated 2 more times] File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1639, in load module._load_from_state_dict( File "/home/data/VisualGLM-6B/lora_mixin.py", line 109, in _load_from_state_dict self.original._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs) File "/home/data/VisualGLM-6B/lora_mixin.py", line 47, in _load_from_statedict self.weight.data.copy(state_dict[prefix+'weight']) RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0 [2023-06-12 06:34:36,019] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 8346 [2023-06-12 06:34:36,019] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '8', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

1049451037 commented 1 year ago

感觉是显卡没配置好，你试一下这段代码你能跑通吗：

from bitsandbytes.nn import LinearNF4
model = LinearNF4(10, 20).cuda()

import torch
x = torch.randn(2, 10).cuda()
out = model(x)

xxuyyuan commented 1 year ago

感觉是显卡没配置好，你试一下这段代码你能跑通吗：
from bitsandbytes.nn import LinearNF4
model = LinearNF4(10, 20).cuda()

import torch
x = torch.randn(2, 10).cuda()
out = model(x)

在脚本里面正常运行； (base) root@6633711ec9b0:/home/data/VisualGLM-6B# python3 Python 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from bitsandbytes.nn import LinearNF4

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so.11.0'), PosixPath('/opt/conda/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...

model = LinearNF4(10, 20).cuda() import torch x = torch.randn(2, 10).cuda() out = model(x)

1049451037 commented 1 year ago

我这边是可以跑起来的，你看一下你的代码和VisualGLM-6B的main分支一致吗？是不是没有更新到最新版或者你本地改了什么？以及bitsandbytes是不是0.39.0版本。

xxuyyuan commented 1 year ago

我这边是可以跑起来的，你看一下你的代码和VisualGLM-6B的main分支一致吗？是不是没有更新到最新版或者你本地改了什么？以及bitsandbytes是不是0.39.0版本。

bitsandbytes版本是0.39.0，然后重新更新了代码，跑了一下； 第一次运行出现： AttributeError: 'FakeTokenizer' object has no attribute 'encode' 详情： [2023-06-12 08:15:11,713] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /opt/conda/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 attention with lora replacing layer 14 attention with lora replacing chatglm linear layer with 4bit [2023-06-12 08:15:58,973] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-06-12 08:15:59,738] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-06-12 08:16:04,555] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-06-12 08:16:07,585] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [2023-06-12 08:32:23,056] [INFO] [RANK 0] Cannot find THUDM/chatglm-6b from Huggingface or sat. Creating a fake tokenizer... Traceback (most recent call last): File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 195, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 67, in training_main train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn) File "/opt/conda/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 197, in make_loaders train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True) File "/opt/conda/lib/python3.10/site-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full d = create_dataset_function(p, args) File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 161, in create_dataset_function dataset = FewShotDataset(path, image_processor, tokenizer, args) File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 119, in init input0 = tokenizer.encode("", add_special_tokens=False) AttributeError: 'FakeTokenizer' object has no attribute 'encode' [2023-06-12 08:32:24,375] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 11313 [2023-06-12 08:32:24,375] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

再次运行： RuntimeError: Error building extension 'fused_adam' 详情： 6633711ec9b0:11642:11786 [0] NCCL INFO Connected all rings 6633711ec9b0:11642:11786 [0] NCCL INFO Connected all trees 6633711ec9b0:11642:11786 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 6633711ec9b0:11642:11786 [0] NCCL INFO comm 0xb1efe50 rank 0 nranks 1 cudaDev 0 busId 54000 - Init COMPLETE [2023-06-12 08:35:47,241] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o In file included from /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:13:0: /opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory

include

      ^~~~~~~~~~~~~~

compilation terminated. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/data/VisualGLM-6B/finetune_visualglm.py", line 195, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 98, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 161, in setup_model_untrainable_params_andoptimizer model, optimizer, , _ = deepspeed.initialize( File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in init self._configure_optimizer(optimizer, model_parameters) File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1236, in _configure_basic_optimizer optimizer = FusedAdam( File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init fused_adam_cuda = FusedAdamBuilder().load() File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load return self.jit_load(verbose) File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load op_module = load(name=self.name, File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library _run_ninja_build( File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' 6633711ec9b0:11642:11783 [0] NCCL INFO [Service thread] Connection closed by localRank 0 6633711ec9b0:11642:11642 [0] NCCL INFO comm 0xb1cab50 rank 0 nranks 1 cudaDev 0 busId 54000 - Abort COMPLETE 6633711ec9b0:11642:11787 [0] NCCL INFO [Service thread] Connection closed by localRank 0 6633711ec9b0:11642:11642 [0] NCCL INFO comm 0xb1efe50 rank 0 nranks 1 cudaDev 0 busId 54000 - Abort COMPLETE [2023-06-12 08:35:49,912] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 11642 [2023-06-12 08:35:49,912] [ERROR] [launch.py:320:sigkill_handler] ['/opt/conda/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1 (base) root@6633711ec9b0:/home/data/VisualGLM-6B# AttributeError: 'FakeTokenizer' object has no attribute 'encode'RuntimeError: Error building extension 'fused_adam'

cuda环境的问题嘛？来来回回陷入死循环；

1049451037 commented 1 year ago

tokenizer的问题可以参考这里：https://github.com/THUDM/VisualGLM-6B/issues/111#issuecomment-1579019781

xxuyyuan commented 1 year ago

tokenizer的问题可以参考这里：#111 (comment)

tokenzier重新运行是正常；

主要是后面的问题： RuntimeError: Error building extension 'fused_adam'，详情见上面；

1049451037 commented 1 year ago

这个应该是deepspeed配置的问题，有一个类似的issue：https://github.com/THUDM/VisualGLM-6B/issues/43

查了一下可能的解决方案：

apt-get update; apt-get install ninja-build
把cuda版本从10.1升级到10.2（https://github.com/microsoft/DeepSpeed/issues/694）

xxuyyuan commented 1 year ago

我这边是可以跑起来的，你看一下你的代码和VisualGLM-6B的main分支一致吗？是不是没有更新到最新版或者你本地改了什么？以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行 args.device = 'cpu' 修改成 args.device = 'cuda'

错误就从RuntimeError: Error building extension 'fused_adam'，变成了维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0；

这里的args.device 指的是什么，为啥会这样呢？

xxuyyuan commented 1 year ago

/opt/conda/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory

include

问题解决了，可以训练啦！！主要是cusolverDn.h: No such file or directory 找不到导致；添加环境变量，export PATH=/usr/local/cuda/bin:$PATH

JumpingRain commented 1 year ago

请问这个问题解决了吗，就是维度不一致的问题

JumpingRain commented 1 year ago

我这边是可以跑起来的，你看一下你的代码和VisualGLM-6B的main分支一致吗？是不是没有更新到最新版或者你本地改了什么？以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行 args.device = 'cpu' 修改成 args.device = 'cuda'

错误就从RuntimeError: Error building extension 'fused_adam'，变成了维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0；

这里的args.device 指的是什么，为啥会这样呢？

请问这个问题解决了吗，维度不一致的问题

xxuyyuan commented 1 year ago

我这边是可以跑起来的，你看一下你的代码和VisualGLM-6B的main分支一致吗？是不是没有更新到最新版或者你本地改了什么？以及bitsandbytes是不是0.39.0版本。

修改了 finetune_visualglm.py #176行 args.device = 'cpu' 修改成 args.device = 'cuda' 错误就从RuntimeError: Error building extension 'fused_adam'，变成了维度问题RuntimeError: The size of tensor a (25165824) must match the size of tensor b (12288) at non-singleton dimension 0；这里的args.device 指的是什么，为啥会这样呢？

请问这个问题解决了吗，维度不一致的问题

这个没有解决，还是改为原来的代码#176行 args.device = 'cpu'这里设置成cpu，然后就是cuda环境问题，RuntimeError: Error building extension 'fused_adam'；通过上面配置环境变量解决了；

1049451037 commented 1 year ago

因为bitsandbytes实现模型量化的时候是通过重载.cuda()函数实现的，也就是说模型在放到显卡的时候会发生量化（改变tensor维度）。在微调的时候，加载的预训练权重是fp16的，所以需要设置args.device='cpu'，把权重加载进来再调用.cuda()。因为这个是bitsandbytes的实现，我们也没办法控制，只能适配。

所以维度不一致是显卡配置的问题，.cuda()调用失败了。