EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.21k stars 945 forks source link

Using gpt neo checkpoint #227

Closed MK096 closed 3 years ago

MK096 commented 3 years ago

Hi, I downloaded gpt neo from theeye.eye on my pc. It downloaded various checkpoints. How do i use them? ... Because in order too load and use model I'd need encoder. Json, pytorch. Bin, etc..

StellaAthena commented 3 years ago

Have you tried the Colab notebook that we provided as a demonstration?

MK096 commented 3 years ago

Have you tried the Colab notebook that we provided as a demonstration?

I went through it, but i couldn't figure out how to make it work on local machine. How to use those checkpoints on my local machine ?

StellaAthena commented 3 years ago

@MK096 for your local machine, I recommend using the transformers library.

MK096 commented 3 years ago

@MK096 for your local machine, I recommend using the transformers library.

I did but it didn't work. I guess it's because the files i downloaded from theeye.eye gave checkpoints file and to run a model I'd need encoder. Json, pyrorch-bin, etc... So how do I generate those files from the files I've downloaded https://github.com/EleutherAI/gpt-neo/issues/226#issue-919573517

StellaAthena commented 3 years ago

@MK096 for your local machine, I recommend using the transformers library.

I did but it didn't work. I guess it's because the files i downloaded from theeye.eye gave checkpoints file and to run a model I'd need encoder. Json, pyrorch-bin, etc... So how do I generate those files from the files I've downloaded #226 (comment)

You do not need to use the files from the eye to use GPT-Neo via HuggingFace. You can directly download everything you need from the transformers package, as shown in the example code I linked to.

If you want to use the copy of the model on the eye, you can use the Google Colab notebook as a guide. If you’ve gotten it working on Colab you should be able to get it working locally, as it’s fundamentally the same thing.

MK096 commented 3 years ago

@MK096 for your local machine, I recommend using the transformers library.

I did but it didn't work. I guess it's because the files i downloaded from theeye.eye gave checkpoints file and to run a model I'd need encoder. Json, pyrorch-bin, etc... So how do I generate those files from the files I've downloaded #226 (comment)

You do not need to use the files from the eye to use GPT-Neo via HuggingFace. You can directly download everything you need from the transformers package, as shown in the example code I linked to.

If you want to use the copy of the model on the eye, you can use the Google Colab notebook as a guide. If you’ve gotten it working on Colab you should be able to get it working locally, as it’s fundamentally the same thing.

I tried huggingface method but the problem was that after downloading 10-20% the downloading speed always reduces from 1 mbps to 5 kbps. Same thing happened when i was trying other huggingface models like opus-mt.

https://github.com/EleutherAI/gpt-neo/issues/219#issue-896609707

StellaAthena commented 3 years ago

I’m really not sure what to say. When I do what you say you are doing I do not have any problems. I can run the model off the eye checkpoint and I can download it through HuggingFace. This appears to be a problem on your end.

MK096 commented 3 years ago

Thanks for the help. I ran everything from scratch. Upon running (in cmd on local machine), now i get following error:

_C:\Users\Mayank\GPTNeo>python main.py --predict --prompt try.txt --model GPT3XL

2021-06-14 21:42:02.505836: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll WARNING:tensorflow:From C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 362000 Saving config to C://Users//Mayank//GPTNeo//models//GPT3_XL 2021-06-14 21:42:15.288177: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-06-14 21:42:16.082196: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d5842f8670 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-06-14 21:42:16.120514: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-06-14 21:42:16.154220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2021-06-14 21:42:16.390683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:02:00.0 name: GeForce MX130 computeCapability: 5.0 coreClock: 1.189GHz coreCount: 3 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 37.33GiB/s 2021-06-14 21:42:16.438659: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-06-14 21:42:16.445569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2021-06-14 21:42:16.455962: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2021-06-14 21:42:16.463522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2021-06-14 21:42:16.474350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2021-06-14 21:42:16.519344: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2021-06-14 21:42:16.530178: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2021-06-14 21:42:16.539556: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2021-06-14 21:42:16.553314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-06-14 21:42:18.177633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-06-14 21:42:18.188617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-06-14 21:42:18.194467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-06-14 21:42:18.205021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1369 MB memory) -> physical GPU (device: 0, name: GeForce MX130, pci bus id: 0000:02:00.0, compute capability: 5.0) 2021-06-14 21:42:18.322554: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d5a2652220 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-06-14 21:42:18.342300: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce MX130, Compute Capability 5.0 2021-06-14 21:42:19.504241: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.. at 0x000001D585D7FD08>, {'n_head': 32, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 512, 'attn_dropout': 0, 'train_steps': 286150, 'eval_steps': 10, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 512, 'predict_batch_size': 1, 'iterations': 500, 'n_embd': 2048, 'datasets': [['pile', 25, 'documents_random', 1.0]], 'model_path': 'C://Users//Mayank//GPTNeo//models//GPT3_XL', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:128,y:2', 'layout': 'batch:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'precision': 'bfloat16', 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'pile': {'nvocab': 50257, 'path': 'gs://neo-datasets/pile/pile.tfrecords', 'eval_path': 'gs://neo-datasets/pile_val.tfrecords', 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 256, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 5000, 'predict': True, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': 'C://Users//Mayank//GPTNeo//models//GPT3_XL', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=500, num_shards=256, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. Predictions generated Calling model_fn. Running infer on CPU/GPU Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) Defaulting to GELU activation (see here: https://arxiv.org/abs/1606.08415) prediction_loop marked as finished Reraising captured error Traceback (most recent call last): File "main.py", line 258, in main(args) File "main.py", line 185, in main handle_pred_output_fn(predictions, logger, enc, params, outname=f"predictions{args.sacredid}{current_step}") File "C:\Users\Mayank\GPTNeo\inputs.py", line 165, in handle_pred_output for i, p in enumerate(predictions): File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 3173, in predict rendezvous.raise_errors() File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\error_handling.py", line 150, in raise_errors six.reraise(typ, value, traceback) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\six.py", line 703, in reraise raise value File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 3167, in predict yield_single_examples=yield_single_examples): File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 613, in predict self.config) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 2962, in _call_model_fn config) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1163, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 3220, in _model_fn features, labels, is_export_mode=is_export_mode) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 1729, in call_without_tpu return self._call_model_fn(features, labels, is_export_mode=is_export_mode) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 2072, in _call_model_fn estimator_spec = self._model_fn(features=features, kwargs) File "C:\Users\Mayank\GPTNeo\model_fns.py", line 112, in model_fn lowering = mtf.Lowering(graph, {mesh: mesh_impl}, autostack=True) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\ops.py", line 728, in init op.lower(self) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\ops.py", line 4541, in lower slices = mesh_impl.allsplit(slices, mesh_axis, tensor_axis) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\ops.py", line 1099, in allsplit which = self.laid_out_pcoord(mesh_axis) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\ops.py", line 1209, in laid_out_pcoord return self.slicewise(my_fn, self.laid_out_pnum()) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\placement_mesh_impl.py", line 173, in slicewise ret = mtf.parallel(self.devices, fn, inputs) File "C:\Users\Mayank\AppData\Local\Programs\Python\Python37\lib\site-packages\mesh_tensorflow\ops.py", line 5661, in parallel "arg=%s devices=%s" % (x, devices)) ValueError: Argument not a list with same length as devices arg=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255] devices=['device:GPU:0']

StellaAthena commented 3 years ago

Great! You are now officially out of our hands... this is a GPU configuration issue on your end. I recommend checking out multiGPU training documentation for whatever framework you are using. This issue may also be helpful.