aws / studio-lab-examples

Example notebooks for working with SageMaker Studio Lab. Sign up for an account at the link below!
https://studiolab.sagemaker.aws
Apache License 2.0
637 stars 185 forks source link

'GPU not available error' even when starting the project in 'GPU mode' #243

Closed ageek closed 8 months ago

ageek commented 9 months ago

Hi, I'm trying to run the following notebook on Studio Lab: https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing After running the initial steps, when I run the code block ` bnb_config = BitsAndBytesConfig( load_in_4bit=True, # load model in 4-bit precision bnb_4bit_quant_type="nf4", # pre-trained model should be quantized in 4-bit NF format bnb_4bit_use_double_quant=True, # Using double quantization as mentioned in QLoRA paper bnb_4bit_compute_dtype=torch.bfloat16, # During computation, pre-trained model should be loaded in BF16 format )

model_name = 'google/flan-t5-base' original_model = AutoModelForSeq2SeqLM.from_pretrained( model_name, quantization_config=bnb_config, device_map='auto', trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(model_name) `

it throws below error:

RuntimeError Traceback (most recent call last) /tmp/ipykernel_155/4023862644.py in <cell line: 12>() 11 ---> 12 original_model = AutoModelForSeq2SeqLM.from_pretrained( 13 model_name, 14 quantization_config=bnb_config,

~/.conda/envs/default/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, *kwargs) 564 elif type(config) in cls._model_mapping.keys(): 565 model_class = _get_model_class(config, cls._model_mapping) --> 566 return model_class.from_pretrained( 567 pretrained_model_name_or_path, model_args, config=config, hub_kwargs, kwargs 568 )

~/.conda/envs/default/lib/python3.9/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs) 2895 if load_in_8bit or load_in_4bit: 2896 if not torch.cuda.is_available(): -> 2897 raise RuntimeError("No GPU found. A GPU is needed for quantization.") 2898 if not (is_accelerate_available() and is_bitsandbytes_available()): 2899 raise ImportError(

RuntimeError: No GPU found. A GPU is needed for quantization.

I checked GPU availability using: !nvidia-smi

Tue Dec 19 06:43:59 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 26C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

GPU is not being used even if its available, what exactly is going wrong here?

icoxfog417 commented 8 months ago

Could you please try sagemaker-distribution? I confirmed torch.cuda.is_available() is True in this environement.

(studiolab) studio-lab-user@default:~$ conda activate sagemaker-distribution
(sagemaker-distribution) studio-lab-user@default:~$ python
Python 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:06:00) 
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
ageek commented 8 months ago

Thanks for your response. Switching to 'sagemaker-distribution' (instead of using 'default') kernel actually fixed the problem.

You've to be very lucky to get 'GPU' compute type. In last 3+ weeks time, I was able to get 'GPU' access 3 times only. I find this platform much better and would like to use it more more often than others (Google Colab, Kaggle etc.) and I'm hopeful 'GPU' compute type availability will improve in future. But, I'm really thankful to you people for providing free GPU hrs for students and AI Researches like me.

icoxfog417 commented 8 months ago

I am glad to here you solved the problem! We are working hard to allocate limited GPU (for me, getting GPU is hard work). Thank you for your patience.