EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.2k stars 945 forks source link

connect timeout #155

Closed wingdi closed 3 years ago

wingdi commented 3 years ago

when i run : !python3 main.py --predict --prompt prompt1.txt --tpu=[0,1,2,3,4,5,6,7] --model /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/config.json

I got :

`2021-03-22 16:14:03.058026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 362000 Saving config to /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL 2021-03-22 16:14:07.810984: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz 2021-03-22 16:14:07.811573: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d6e95b8f40 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-22 16:14:07.811629: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-22 16:14:07.814273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-03-22 16:14:07.824584: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-03-22 16:14:07.824644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b30276c531da): /proc/driver/nvidia/version does not exist 2021-03-22 16:14:07.831001: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.. at 0x7f8fed3e0b00>, {'n_head': 16, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0, 'train_batch_size': 512, 'attn_dropout': 0, 'train_steps': 400000, 'lr_decay_end': 300000, 'eval_steps': 10, 'predict_steps': 0, 'res_dropout': 0, 'eval_batch_size': 128, 'predict_batch_size': 128, 'iterations': 500, 'n_embd': 2048, 'datasets': [['pile', None, None, None]], 'model_path': '/content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local'], 'mesh_shape': 'x:128,y:2', 'layout': 'batch:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 4096, 'precision': 'bfloat16', 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'pile': {'nvocab': 50257, 'path': 'gs://neo-datasets/pile/pile.tfrecords', 'eval_path': 'gs://neo-datasets/pile_val.tfrecords', 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 256, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 5000, 'predict': True, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Traceback (most recent call last): File "/usr/lib/python3.7/urllib/request.py", line 1350, in do_open encode_chunked=req.has_header('Transfer-encoding')) File "/usr/lib/python3.7/http/client.py", line 1277, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1323, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1272, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1032, in _send_output self.send(msg) File "/usr/lib/python3.7/http/client.py", line 972, in send self.connect() File "/usr/lib/python3.7/http/client.py", line 944, in connect (self.host,self.port), self.timeout, self.source_address) File "/usr/lib/python3.7/socket.py", line 728, in create_connection raise err _File "/usr/lib/python3.7/socket.py", line 716, in createconnection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "main.py", line 262, in main(args) File "main.py", line 139, in main tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(args.tpu) if params["use_tpu"] else None File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py", line 207, in init discovery_url=discovery_url) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 164, in init self._project = _request_compute_metadata('project/project-id') File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 82, in _request_compute_metadata resp = request.urlopen(req) File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/usr/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(args) File "/usr/lib/python3.7/urllib/request.py", line 1378, in http_open return self.do_open(http.client.HTTPConnection, req) _File "/usr/lib/python3.7/urllib/request.py", line 1352, in doopen raise URLError(err) urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>`

env: colab TPU; code is : git clone https://github.com/EleutherAI/gpt-neo.git

StellaAthena commented 3 years ago

I'm not sure what to say, this looks like a server-side issue to me. Try it again, and if it persists contact Google.