OpenMOSS / MOSS

An open-source tool-augmented conversational language model from Fudan University
https://txsun1997.github.io/blogs/moss.html
Apache License 2.0
11.9k stars 1.14k forks source link

请教:微调报错怎么解决? #316

Open zhonglin516 opened 1 year ago

zhonglin516 commented 1 year ago

硬件:单机,8张3090 配置: command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'yes' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: 192.168.33.201 main_process_port: 21889 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false 和 num_machines=1 num_processes=$((num_machines * 8)) machine_rank=0

accelerate launch \ --config_file ./configs/sft.yaml \ --num_processes $num_processes \ --num_machines $num_machines \ --machine_rank $machine_rank \ --deepspeed_multinode_launcher standard finetune_moss.py \ --model_name_or_path fnlp/moss-moon-003-sft-int4 \ --data_dir ./sft_data \ --output_dir ./ckpts/moss-moon-003-sft-int4 \ --log_dir ./train_logs/moss-moon-003-sft-int4 \ --n_epochs 2 \ --train_bsz_per_gpu 4 \ --eval_bsz_per_gpu 4 \ --learning_rate 0.000015 \ --eval_step 15 \ --save_step 35 \

报错: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6 INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, *kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained return cls._from_pretrained( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/zhangzhong/.cache/huggingface/modules/transformers_modules/fnlp/moss-moon-003-sft-int4/e3f0d7e7fba3944d5932ca2608b816678220ed25/tokenization_moss.py", line 173, in init with open(vocab_file, encoding="utf-8") as vocab_handle: TypeError: expected str, bytes or os.PathLike object, not NoneType Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() OSError: [Errno 107] Transport endpoint is not connected

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected')) Downloading: 44%|██████████████████████████▊ | 1.10M/2.50M [00:01<00:00, 1.50MB/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2556277 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2556278) of binary: /opt/anaconda3/bin/python Traceback (most recent call last): File "/opt/anaconda3/bin/accelerate", line 8, in sys.exit(main()) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 900, in launch_command deepspeed_launcher(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher distrib_run.run(args) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_moss.py FAILED

Failures: [1]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 2 (local_rank: 2) exitcode : 1 (pid: 2556279) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 3 (local_rank: 3) exitcode : 1 (pid: 2556280) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 4 (local_rank: 4) exitcode : 1 (pid: 2556281) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 5 (local_rank: 5) exitcode : 1 (pid: 2556282) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 6 (local_rank: 6) exitcode : 1 (pid: 2556283) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 7 (local_rank: 7) exitcode : 1 (pid: 2556284) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 1 (local_rank: 1) exitcode : 1 (pid: 2556278) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

LMXKO commented 1 year ago

硬件:单机,8张3090 配置: command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'yes' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: 192.168.33.201 main_process_port: 21889 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false 和 num_machines=1 num_processes=$((num_machines * 8)) machine_rank=0

accelerate launch --config_file ./configs/sft.yaml --num_processes $num_processes --num_machines $num_machines --machine_rank $machine_rank --deepspeed_multinode_launcher standard finetune_moss.py --model_name_or_path fnlp/moss-moon-003-sft-int4 --data_dir ./sft_data --output_dir ./ckpts/moss-moon-003-sft-int4 --log_dir ./train_logs/moss-moon-003-sft-int4 --n_epochs 2 --train_bsz_per_gpu 4 --eval_bsz_per_gpu 4 --learning_rate 0.000015 --eval_step 15 --save_step 35 \

报错: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6 INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, *kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, *kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained return cls._from_pretrained( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained tokenizer = cls(*init_inputs, init_kwargs) File "/home/zhangzhong/.cache/huggingface/modules/transformers_modules/fnlp/moss-moon-003-sft-int4/e3f0d7e7fba3944d5932ca2608b816678220ed25/tokenization_moss.py", line 173, in init** with open(vocab_file, encoding="utf-8") as vocab_handle: TypeError: expected str, bytes or os.PathLike object, not NoneType Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() OSError: [Errno 107] Transport endpoint is not connected

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected')) Downloading: 44%|██████████████████████████▊ | 1.10M/2.50M [00:01<00:00, 1.50MB/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2556277 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2556278) of binary: /opt/anaconda3/bin/python Traceback (most recent call last): File "/opt/anaconda3/bin/accelerate", line 8, in sys.exit(main()) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 900, in launch_command deepspeed_launcher(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher distrib_run.run(args) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_moss.py FAILED

Failures:

[1]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 2 (local_rank: 2) exitcode : 1 (pid: 2556279) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 3 (local_rank: 3) exitcode : 1 (pid: 2556280) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 4 (local_rank: 4) exitcode : 1 (pid: 2556281) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 5 (local_rank: 5) exitcode : 1 (pid: 2556282) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 6 (local_rank: 6) exitcode : 1 (pid: 2556283) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 7 (local_rank: 7) exitcode : 1 (pid: 2556284) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):

[0]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 1 (local_rank: 1) exitcode : 1 (pid: 2556278) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

我配置跟你一样,报的错也一样,但我用没量化的模型+8个A100这个错就没了,很奇怪

zhonglin516 commented 1 year ago

用了base模型,还是报错,RuntimeError: Socket Timeout

LMXKO commented 1 year ago

用了base模型,还是报错,RuntimeError: Socket Timeout

我是单机8卡,不涉及到socket通信

zhonglin516 commented 1 year ago

我也是单机8卡3090,是哪里要设置

lhtpluto commented 1 year ago

单卡 没遇到你们的问题