EleutherAI / gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
https://www.eleuther.ai
MIT License
8.21k stars 945 forks source link

Getting errors when running the command to generate text #190

Closed baljeetrathi closed 3 years ago

baljeetrathi commented 3 years ago

Hi,

I downloaded this pre-trained model on my system: https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/

I modified the config.json file to look like below:

{
"n_head" : 16,
"n_vocab" : 50257,
"embed_dropout" : 0,
"lr" : 0.0002,
"lr_decay" : "cosine",
"warmup_steps" : 3000,
"beta1" : 0.9,
"beta2" : 0.95,
"epsilon" : 1e-08,
"opt_name" : "adam",
"weight_decay" : 0,
"train_batch_size" : 512,
"attn_dropout" : 0,
"train_steps" : 400000,
"lr_decay_end" : 300000,
"eval_steps" : 10,
"predict_steps" : 0,
"res_dropout" : 0,
"eval_batch_size" : 128,
"predict_batch_size" : 128,
"iterations" : 500,
"n_embd" : 2048,
"datasets" : [["pile", null, null, null]],
"model_path" : "D:\\gpt-neo\\the-eye.eu\\public\\AI\\gptneo-release\\GPT3_XL",
"n_ctx" : 2048,
"n_layer" : 24,
"scale_by_depth" : true,
"scale_by_in" : false,
"attention_types" : [[["global", "local"], 12]],
"mesh_shape" : "x:128,y:2",
"layout" : "batch:x,memory_length:y,embd:y",
"activation_function" : "gelu",
"recompute_grad" : true,
"gradient_clipping" : 1.0,
"tokens_per_mb_per_replica" : 4096,
"precision" : "bfloat16",
"padding_id" : 50257,
"eos_id" : 50256
}

After that, I ran the following command:

main.py --predict --prompt "I like Apples" --tpu "device:CPU:0" --model "D:\gpt-neo\the-eye.eu\public\AI\gptneo-release\GPT3_XL\config.json" 

Running the above command gives me the following errors

image

image

image

How should I proceed next?

Thanks. :)

StellaAthena commented 3 years ago

The code is unable to find a component of the CUDA library. I would start by double checking that your CUDA instillation is compatible with your pytorch version and your GPU.

baljeetrathi commented 3 years ago

Thanks @StellaAthena :)

I thought CUDA library is only required if I am using GPU and not TPU in the command. I will install the library and try to run the command again.

baljeetrathi commented 3 years ago

I installed CUDA library and the first error I get seems to be about rebuilding TensorFlow:

This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Here is the detailed log:

WARNING:tensorflow:From D:\gpt-neo\text-gen\env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 362000
Saving config to D:\gpt-neo\the-eye.eu\public\AI\gptneo-release\GPT3_XL
2021-04-03 14:04:47.198757: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-03 14:04:47.200719: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-03 14:04:47.221016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.43GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
2021-04-03 14:04:47.221169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-03 14:04:47.228380: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-03 14:04:47.228506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-03 14:04:47.232167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-03 14:04:47.233395: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-03 14:04:47.234645: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
2021-04-03 14:04:47.237890: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-03 14:04:47.239166: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2021-04-03 14:04:47.239275: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-04-03 14:04:47.295157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-03 14:04:47.295261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2021-04-03 14:04:47.295775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2021-04-03 14:04:47.295861: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-03 14:04:47.347030: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!

There are probably problems with other cuda related .dll files. I couldn't fully understand what is required.

What should I do now?

Thanks.

baljeetrathi commented 3 years ago

Update:

I also installed cuDNN and now I run the following command:

main.py --predict --prompt "..\text\apple.txt" --gpu_ids "device:GPU:0" --model "D:\gpt-neo\the-eye.eu\public\AI\gptneo-release\GPT3_XL\config.json"

The errors now look like the image below:

image

Why am I not seeing any output if the prediction loop is marked as finished?

Thanks. :)

StellaAthena commented 3 years ago

Try changing line 20 to read

mesh_shape = [("all_processors", 1)]

The predictions are written to a file so if it’s really successfully predicting you should be able to open that file even if the function crashes.

baljeetrathi commented 3 years ago

Hi @StellaAthena :)

Which file should I change? The error messages don't mention any line 20. So, I went ahead and looked at line 20 in main.py but it was empty.

I searched for mesh_shape in the file and found

mesh_shape = mtf.convert_to_shape(params["mesh_shape"])

So, I replaced it with

mesh_shape = [("all_processors", 1)]

However, this gave me some new errors:

image

Thanks.

StellaAthena commented 3 years ago

Oh, I meant in the Colab notebook. I assumed that you were using that. Are you using a local GPU?

baljeetrathi commented 3 years ago

Yes, I am on a local GPU. I am using Intel Core i5 9400F and Nvidia 1050Ti if that helps. :)

StellaAthena commented 3 years ago

Are you using the HuggingFace transformers library or are you using this repo directly?

baljeetrathi commented 3 years ago

I am using the repo directly. :)

I would prefer to get it to work without using any extra services.

StellaAthena commented 3 years ago

I would strongly recommend the transformers library, tbh. It's not a "service," it's a python package.

Anyways, that modification should go in your config file, not in any of the code itself. The model config file defines the mesh shape.

baljeetrathi commented 3 years ago

Thanks @StellaAthena :)

I will give transformers a try. I thought I would have to sign up with a HuggingFace account to use the package. :)

baljeetrathi commented 3 years ago

Should I delete the pre-trained model that I downloaded from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/ or is it needed by HuggingFace?

Thanks.

baljeetrathi commented 3 years ago

@StellaAthena I installed the tranformers library in my virtual environment for GPTNeo. After successful installation (https://huggingface.co/transformers/installation.html), I created a generate.py file in the GPTNeo directory with the following code from the installation guide:

from transformers import pipeline
print(pipeline('sentiment-analysis')('we love you'))

The output was:

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

Now, I tried to use some other code, listed here (https://huggingface.co/EleutherAI/gpt-neo-2.7B):

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)

However, placing the above code in generate.py and running it gives me the following error:

Traceback (most recent call last):
  File "D:\gpt-neo\text-gen\GPTNeo\generate.py", line 2, in <module>
    generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
  File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\pipelines.py", line 3231, in pipeline
    framework = framework or get_framework(model)
  File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\pipelines.py", line 109, in get_framework
    model = TFAutoModel.from_pretrained(model, revision=revision)
  File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\models\auto\modeling_tf_auto.py", line 575, in from_pretrained
    pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
  File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\models\auto\configuration_auto.py", line 352, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'gpt_neo'

Am I missing something?

Thanks. :)

StellaAthena commented 3 years ago

Should I delete the pre-trained model that I downloaded from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/ or is it needed by HuggingFace?

Delete it.

Am I missing something?

I think HF hasn't made an official release since adding Neo. Try pip install git+https://github.com/huggingface/transformers and then reexecute the code

baljeetrathi commented 3 years ago

Hi @StellaAthena

Thanks for your help. It no longer gives any errors. However, it also doesn't generate predictions as far as I know.

Here is the code inside my generate.py:

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)

When I run it through CMD, it shows me a message about loading some CUDA dll, and then after a minute or so by SSD and RAM usage spike to 100% and stays there. I tried it three times and waited for about 10-15 minutes each time. However, the RAM and SSD usage does not come down. I have to restart the PC using the power button after that as it stops responding.

Is this normal?

I have installed Python in D (HDD) and my Windows is in C (SSD) if that helps.

Thanks. :)

StellaAthena commented 3 years ago

What specs do you have on your computer? It’s possible that you just can’t handle a 2.7B model. People with consumer GPU cards have generally had more success with the 1.3B version.

baljeetrathi commented 3 years ago

My CPU is Intel Core i5-9400F and my GPU is Nvidia 1050Ti. Are they not enough?

The thing is I don't see any CPU (5-10%) or GPU(1-2%) usage inside the task manager. Only the RAM (16GB) and the SSD are at 100% usage.

chris-aeviator commented 3 years ago

I can confirm being able to run inference on transformers 2.7B with no CUDA (only CPU). I scaled via proxmox until the model loaded correctly and inference returned a result. The Machine is a 64 GB 24 Core Xeon Gold Server, but scaling down let's me simulate a smaller machine.

My current runnable configuration is

running generator("EleutherAI has chosen a good place for the cherry tree, though", do_sample=True, min_length=50) gives me

image

in about 1 min.

Some queries take substantially longer, especially when setting min_length & max_lengtht

baljeetrathi commented 3 years ago

Thanks @chris-aeviator :)

That provides me some perspective about the needed specs. I guess my RAM was on the lower side. I still don't understand the 100% SSD usage though. There was no siginifant uptick in GPU or CPU usage either.

chris-aeviator commented 3 years ago

Thanks @chris-aeviator :)

That provides me some perspective about the needed specs. I guess my RAM was on the lower side. I still don't understand the 100% SSD usage though. There was no siginifant uptick in GPU or CPU usage either.

what happens at a minimum is that the model (10GB) get's copied into RAM. After that 4 cores are running on medium utilization on the generator command for me.

I'm even running this from (15k) SAS HDD's

@brianherman I'm fairly sure that you have your swap filled (onto the SSD) and pytorch/tflow busy trying to avoid an out of memory kill from the kernel.

baljeetrathi commented 3 years ago

Thanks @chris-aeviator for clearing up my doubts. :)