WangRongsheng / XrayGLM

🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.
Other
912 stars 130 forks source link

重新训练时出现以下错误,求解答! #35

Closed yunfei920406 closed 1 year ago

yunfei920406 commented 1 year ago

bash finetune_XrayGLM.sh NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_XrayGLM.py --experiment-name finetune-XrayGLM --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./data/Xray/openi-zh.json --valid-data ./data/Xray/openi-zh.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 3000 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 2 --skip-init --fp16 --use_lora [2023-06-17 05:52:24,428] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-17 05:52:24,793] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0 [2023-06-17 05:52:24,806] [INFO] [runner.py:555:main] cmd = /home/yunfei/XrayGLM-main/venv2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_XrayGLM.py --experiment-name finetune-XrayGLM --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./data/Xray/openi-zh.json --valid-data ./data/Xray/openi-zh.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 3000 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 2 --skip-init --fp16 --use_lora [2023-06-17 05:52:25,364] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-17 05:52:25,599] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-06-17 05:52:25,599] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-06-17 05:52:25,599] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-06-17 05:52:25,599] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-06-17 05:52:25,599] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-06-17 05:52:25,599] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-06-17 05:52:25,599] [INFO] [launch.py:163:main] dist_world_size=1 [2023-06-17 05:52:25,599] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-06-17 05:52:26,180] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/yunfei/anaconda3 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.3/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('@/tmp/.ICE-unix/2258,unix/yunfei'), PosixPath('local/yunfei')} warn(msg) /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/etc/xdg/xdg-ubuntu')} warn(msg) /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('0'), PosixPath('1')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)! warn(msg) CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so... [2023-06-17 05:52:26,857] [WARNING] Failed to load bitsandbytes:No module named 'scipy' [2023-06-17 05:52:26,866] [INFO] using world size: 1 and model-parallel size: 1 [2023-06-17 05:52:26,867] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2023-06-17 05:52:26,869] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-06-17 05:52:26,869] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-17 05:52:26,869] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-17 05:52:26,869] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-06-17 05:52:26,870] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-06-17 05:52:26,870] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-06-17 05:52:26,871] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /home/yunfei/XrayGLM-main/venv2/lib/python3.9/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") replacing layer 0 attention with lora replacing layer 14 attention with lora [2023-06-17 05:52:32,096] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7811237376 [2023-06-17 05:52:32,398] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/yunfei/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-06-17 05:52:43,645] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 35510 [2023-06-17 05:52:43,647] [ERROR] [launch.py:321:sigkill_handler] ['/home/yunfei/XrayGLM-main/venv2/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-XrayGLM', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/Xray/openi-zh.json', '--valid-data', './data/Xray/openi-zh.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '3000', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '2', '--skip-init', '--fp16', '--use_lora'] exits with return code = -9

yunfei920406 commented 1 year ago

我用的环境是linux,图片下载好,Png文件直接拷贝到data/Xray中,单卡4090,