Closed baljeetrathi closed 3 years ago
The code is unable to find a component of the CUDA library. I would start by double checking that your CUDA instillation is compatible with your pytorch version and your GPU.
Thanks @StellaAthena :)
I thought CUDA library is only required if I am using GPU and not TPU in the command. I will install the library and try to run the command again.
I installed CUDA library and the first error I get seems to be about rebuilding TensorFlow:
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Here is the detailed log:
WARNING:tensorflow:From D:\gpt-neo\text-gen\env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 362000
Saving config to D:\gpt-neo\the-eye.eu\public\AI\gptneo-release\GPT3_XL
2021-04-03 14:04:47.198757: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-03 14:04:47.200719: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-03 14:04:47.221016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.43GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
2021-04-03 14:04:47.221169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-03 14:04:47.228380: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-03 14:04:47.228506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-03 14:04:47.232167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-03 14:04:47.233395: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-03 14:04:47.234645: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
2021-04-03 14:04:47.237890: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-03 14:04:47.239166: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2021-04-03 14:04:47.239275: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-04-03 14:04:47.295157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-03 14:04:47.295261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-04-03 14:04:47.295775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-04-03 14:04:47.295861: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-03 14:04:47.347030: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!
There are probably problems with other cuda related .dll
files. I couldn't fully understand what is required.
What should I do now?
Thanks.
Update:
I also installed cuDNN and now I run the following command:
main.py --predict --prompt "..\text\apple.txt" --gpu_ids "device:GPU:0" --model "D:\gpt-neo\the-eye.eu\public\AI\gptneo-release\GPT3_XL\config.json"
The errors now look like the image below:
Why am I not seeing any output if the prediction loop is marked as finished?
Thanks. :)
Try changing line 20 to read
mesh_shape = [("all_processors", 1)]
The predictions are written to a file so if it’s really successfully predicting you should be able to open that file even if the function crashes.
Hi @StellaAthena :)
Which file should I change? The error messages don't mention any line 20. So, I went ahead and looked at line 20 in main.py
but it was empty.
I searched for mesh_shape
in the file and found
mesh_shape = mtf.convert_to_shape(params["mesh_shape"])
So, I replaced it with
mesh_shape = [("all_processors", 1)]
However, this gave me some new errors:
Thanks.
Oh, I meant in the Colab notebook. I assumed that you were using that. Are you using a local GPU?
Yes, I am on a local GPU. I am using Intel Core i5 9400F and Nvidia 1050Ti if that helps. :)
Are you using the HuggingFace transformers
library or are you using this repo directly?
I am using the repo directly. :)
I would prefer to get it to work without using any extra services.
I would strongly recommend the transformers
library, tbh. It's not a "service," it's a python package.
Anyways, that modification should go in your config file, not in any of the code itself. The model config file defines the mesh shape.
Thanks @StellaAthena :)
I will give transformers
a try. I thought I would have to sign up with a HuggingFace account to use the package. :)
Should I delete the pre-trained model that I downloaded from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/ or is it needed by HuggingFace?
Thanks.
@StellaAthena I installed the tranformers
library in my virtual environment for GPTNeo
. After successful installation (https://huggingface.co/transformers/installation.html), I created a generate.py
file in the GPTNeo
directory with the following code from the installation guide:
from transformers import pipeline
print(pipeline('sentiment-analysis')('we love you'))
The output was:
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
Now, I tried to use some other code, listed here (https://huggingface.co/EleutherAI/gpt-neo-2.7B):
from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)
However, placing the above code in generate.py
and running it gives me the following error:
Traceback (most recent call last):
File "D:\gpt-neo\text-gen\GPTNeo\generate.py", line 2, in <module>
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\pipelines.py", line 3231, in pipeline
framework = framework or get_framework(model)
File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\pipelines.py", line 109, in get_framework
model = TFAutoModel.from_pretrained(model, revision=revision)
File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\models\auto\modeling_tf_auto.py", line 575, in from_pretrained
pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
File "D:\gpt-neo\text-gen\env\lib\site-packages\transformers\models\auto\configuration_auto.py", line 352, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'gpt_neo'
Am I missing something?
Thanks. :)
Should I delete the pre-trained model that I downloaded from https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/ or is it needed by HuggingFace?
Delete it.
Am I missing something?
I think HF hasn't made an official release since adding Neo. Try pip install git+https://github.com/huggingface/transformers
and then reexecute the code
Hi @StellaAthena
Thanks for your help. It no longer gives any errors. However, it also doesn't generate predictions as far as I know.
Here is the code inside my generate.py
:
from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)
When I run it through CMD, it shows me a message about loading some CUDA dll, and then after a minute or so by SSD and RAM usage spike to 100% and stays there. I tried it three times and waited for about 10-15 minutes each time. However, the RAM and SSD usage does not come down. I have to restart the PC using the power button after that as it stops responding.
Is this normal?
I have installed Python in D (HDD)
and my Windows is in C (SSD)
if that helps.
Thanks. :)
What specs do you have on your computer? It’s possible that you just can’t handle a 2.7B model. People with consumer GPU cards have generally had more success with the 1.3B version.
My CPU is Intel Core i5-9400F and my GPU is Nvidia 1050Ti. Are they not enough?
The thing is I don't see any CPU (5-10%) or GPU(1-2%) usage inside the task manager. Only the RAM (16GB) and the SSD are at 100% usage.
I can confirm being able to run inference on transformers 2.7B with no CUDA (only CPU). I scaled via proxmox until the model loaded correctly and inference returned a result. The Machine is a 64 GB 24 Core Xeon Gold Server, but scaling down let's me simulate a smaller machine.
My current runnable configuration is
running generator("EleutherAI has chosen a good place for the cherry tree, though", do_sample=True, min_length=50)
gives me
in about 1 min.
Some queries take substantially longer, especially when setting min_length
& max_lengtht
Thanks @chris-aeviator :)
That provides me some perspective about the needed specs. I guess my RAM was on the lower side. I still don't understand the 100% SSD usage though. There was no siginifant uptick in GPU or CPU usage either.
Thanks @chris-aeviator :)
That provides me some perspective about the needed specs. I guess my RAM was on the lower side. I still don't understand the 100% SSD usage though. There was no siginifant uptick in GPU or CPU usage either.
what happens at a minimum is that the model (10GB) get's copied into RAM. After that 4 cores are running on medium utilization on the generator
command for me.
I'm even running this from (15k) SAS HDD's
@brianherman I'm fairly sure that you have your swap filled (onto the SSD) and pytorch/tflow busy trying to avoid an out of memory kill from the kernel.
Thanks @chris-aeviator for clearing up my doubts. :)
Hi,
I downloaded this pre-trained model on my system: https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/
I modified the
config.json
file to look like below:After that, I ran the following command:
Running the above command gives me the following errors
How should I proceed next?
Thanks. :)