EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.23k stars 952 forks source link

GPT3XL training #109

Closed loretoparisi closed 3 years ago

loretoparisi commented 3 years ago

It's not clear to me how to train the GPT3XL via GPU/Colab. Could you add more details?

Thank you.

srulikbd commented 3 years ago

there are some incompatibility between the tokenizers to the transformers version (it's installing the current transformers version, but the old tokenizers one).

  1. which versions should we use?
loretoparisi commented 3 years ago

@srulikbd I asked to Thomas Wolf from HF about this, and his suggestion was to use the latest version of both. Could you be more specific about the tokenizer's version issue? Thank you.

srulikbd commented 3 years ago

hey. it seems that it's working right now. I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default . I might add soon more changes if needed..
StellaAthena commented 3 years ago

hey. it seems that it's working right now. I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default . I might add soon more changes if needed..

Great! Can you put these changes on a branch and open a PR? That way we can verify that it doesn’t break anything on the TPUs and merge it.

srulikbd commented 3 years ago

yeah, of course. I'll do that as soon as possible. well done for your awesome work!

srulikbd commented 3 years ago

@StellaAthena hey. I got to the training stage, but it got stuck over for some reason. do you have any idea why? I succeed easily run the train_enwik8 on the gpt-neox library...what is the difference between the 2 packages?

here is the output after running on google colab the GPTNEO example:

2021-01-08 22:33:49.795424: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:49.795465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 0 Saving config to /content/GPTNeo/model_weights 2021-01-08 22:33:53.177601: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-08 22:33:53.177746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-08 22:33:53.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-08 22:33:53.178523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-08 22:33:53.178667: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.180363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-08 22:33:53.180792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-08 22:33:53.182284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-08 22:33:53.182413: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182497: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-01-08 22:33:53.285094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-08 22:33:53.285162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-08 22:33:53.285182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-08 22:33:53.291654: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f9ed91c0158>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 1, 'attn_dropout': 0, 'train_steps': 1, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 1, 'n_embd': 2048, 'datasets': [['openwebtexts', 21, 'documents_random', 1.0]], 'model': 'GPT', 'model_path': '/content/GPTNeo/model_weights', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'openwebtexts': {'path': '/content/GPTNeo/openwebtext-small/bundestag_*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 2, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': '/content/GPTNeo/model_weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.

StellaAthena commented 3 years ago

Where are you running this code? Are you using your own GPUs?

srulikbd commented 3 years ago

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100. The bug i quoted is from GPU colab.

StellaAthena commented 3 years ago

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100. The bug i quoted is from GPU colab.

Sorry this slipped through the cracks. I assume you got everything working based on your PR?

srulikbd commented 3 years ago

actually it might still not work. I saw that you are focused on gptneox, so I switched over there :)