Closed xtanitfy closed 7 months ago
看起来它在进行远程评估,我怎么评估关掉,请问下边两个参数怎么设置才能把它关掉? --per_device_eval_batch_size 1 \ --evaluation_strategy "no" \
我又进行了单卡测试,错误信息如下,貌似它在get_hf_file_metadata,下载不到,保存模型也要下载这个吗?请问我怎么离线下载好,就是断开网络也可以正常训练。希望能得到帮助,感激不尽!
(Qwen) shawn@compute2:~/diska/samba/Train/BIgmode/Qwen-main$ bash finetune/finetune_qlora_single_gpu.sh
[2024-02-21 16:01:03,344] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You passed quantization_config
to from_pretrained
but the model you're loading already has a quantization_config
attribute and has already quantized weights. However, loading attributes (e.g. disable_exllama, use_cuda_fp16) will be overwritten with the one you passed to from_pretrained
. The rest will be ignored.
Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|███████████████████████████████████| 3/3 [00:02<00:00, 1.42it/s]
trainable params: 143,130,624 || all params: 1,388,056,576 || trainable%: 10.311584302454254
Loading data...
Formatting inputs...Skip in lazy mode
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Using /home/shawn/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/shawn/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.04670214653015137 seconds
{'loss': 1.4348, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 1.6305, 'learning_rate': 0.0003, 'epoch': 0.02}
{'loss': 1.7119, 'learning_rate': 0.0003, 'epoch': 0.03}
{'loss': 1.3699, 'learning_rate': 0.0003, 'epoch': 0.04}
{'loss': 1.1099, 'learning_rate': 0.0003, 'epoch': 0.05}
{'loss': 1.2408, 'learning_rate': 0.0003, 'epoch': 0.06}
{'loss': 0.8442, 'learning_rate': 0.0003, 'epoch': 0.07}
{'loss': 0.902, 'learning_rate': 0.0003, 'epoch': 0.08}
{'loss': 0.6552, 'learning_rate': 0.0003, 'epoch': 0.1}
{'loss': 0.7207, 'learning_rate': 0.0003, 'epoch': 0.11}
{'loss': 0.7115, 'learning_rate': 0.0003, 'epoch': 0.12}
{'loss': 0.7682, 'learning_rate': 0.0003, 'epoch': 0.13}
{'loss': 0.7237, 'learning_rate': 0.0003, 'epoch': 0.14}
{'loss': 0.8801, 'learning_rate': 0.0003, 'epoch': 0.15}
{'loss': 0.8988, 'learning_rate': 0.0003, 'epoch': 0.16}
{'loss': 0.8621, 'learning_rate': 0.0003, 'epoch': 0.17}
{'loss': 0.7351, 'learning_rate': 0.0003, 'epoch': 0.18}
{'loss': 0.8254, 'learning_rate': 0.0003, 'epoch': 0.19}
{'loss': 0.777, 'learning_rate': 0.0003, 'epoch': 0.2}
{'train_runtime': 172.7308, 'train_samples_per_second': 0.877, 'train_steps_per_second': 0.11, 'train_loss': 0.9895743319862768, 'epoch': 0.2}
100%|████████████████████████████████████████████████████████████| 19/19 [02:52<00:00, 9.09s/it]
Traceback (most recent call last):
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn conn.connect() File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect self.sock = sock = self._new_conn() File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fc4e8111ba0>: Failed to establish a new connection: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen-7B-Chat-Int4/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc4e8111ba0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/shawn/diska/samba/Train/BIgmode/Qwen-main/finetune.py", line 374, in
export HF_ENDPOINT=https://hf-mirror.com 加在运行前,问题已经解决
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
(Qwen) shawn@compute2:~/diska/samba/Train/BIgmode/Qwen-main$ sh scripts/train_finetune.sh WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-02-21 10:08:26,506] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-21 10:08:26,513] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-21 10:08:31,106] [INFO] [comm.py:637:init_distributed] cdb=None [2024-02-21 10:08:31,201] [INFO] [comm.py:637:init_distributed] cdb=None [2024-02-21 10:08:31,202] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl You passed
quantization_config
tofrom_pretrained
but the model you're loading already has aquantization_config
attribute and has already quantized weights. However, loading attributes (e.g. disable_exllama, use_cuda_fp16) will be overwritten with the one you passed tofrom_pretrained
. The rest will be ignored. You passedquantization_config
tofrom_pretrained
but the model you're loading already has aquantization_config
attribute and has already quantized weights. However, loading attributes (e.g. disable_exllama, use_cuda_fp16) will be overwritten with the one you passed tofrom_pretrained
. The rest will be ignored. Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。 Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。 Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.09s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.33s/it] trainable params: 143,130,624 || all params: 1,388,056,576 || trainable%: 10.311584302454254 Loading data... Formatting inputs...Skip in lazy mode Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. trainable params: 143,130,624 || all params: 1,388,056,576 || trainable%: 10.311584302454254 Using /home/shawn/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /home/shawn/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/shawn/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.12494802474975586 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.20195293426513672 seconds {'loss': 1.6189, 'learning_rate': 0.0, 'epoch': 0.04}{'loss': 1.42, 'learning_rate': 0.0003, 'epoch': 0.08}
{'loss': 1.4318, 'learning_rate': 0.0003, 'epoch': 0.13}
{'loss': 1.3541, 'learning_rate': 0.0003, 'epoch': 0.17}
{'loss': 1.1495, 'learning_rate': 0.0003, 'epoch': 0.21}
{'loss': 1.1028, 'learning_rate': 0.0003, 'epoch': 0.25}
{'loss': 0.9797, 'learning_rate': 0.0003, 'epoch': 0.29}
{'train_runtime': 97.7663, 'train_samples_per_second': 2.323, 'train_steps_per_second': 0.072, 'train_loss': 1.2938276018415178, 'epoch': 0.29} 100%|███████████████████████████████████████████████████████████████████████████| 7/7 [01:37<00:00, 13.97s/it] Traceback (most recent call last): File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn sock = connection.create_connection( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) OSError: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn conn.connect() File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect self.sock = sock = self._new_conn() File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7faa660a8ee0>: Failed to establish a new connection: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen-7B-Chat-Int4/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7faa660a8ee0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/shawn/diska/samba/Train/BIgmode/Qwen-main/finetune.py", line 374, in
train()
File "/home/shawn/diska/samba/Train/BIgmode/Qwen-main/finetune.py", line 370, in train
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir, bias=lora_args.lora_bias)
File "/home/shawn/diska/samba/Train/BIgmode/Qwen-main/finetune.py", line 122, in safe_save_model_for_hf_trainer
trainer._save(output_dir, state_dict=state_dict)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/transformers/trainer.py", line 2865, in _save
self.model.save_pretrained(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/peft/peft_model.py", line 216, in save_pretrained
output_state_dict = get_peft_model_state_dict(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 146, in get_peft_model_state_dict
has_remote_config = file_exists(model_id, "config.json")
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2386, in file_exists
get_hf_file_metadata(url, token=token)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(args, kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
r = _request_wrapper(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
response = _request_wrapper(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
response = get_session().request(method=method, url=url, params)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, send_kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 67, in send
return super().send(request, args, kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen/Qwen-7B-Chat-Int4/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7faa660a8ee0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 3813df15-ee1e-4ce7-ac2b-75b387e41eb8)')
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881) of binary: /home/shawn/diska/anaconda/envs/Qwen/bin/python
Traceback (most recent call last):
File "/home/shawn/diska/anaconda/envs/Qwen/bin/torchrun", line 8, in
sys.exit(main())
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/shawn/diska/anaconda/envs/Qwen/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: