CUDA out of memory falcon-40b when using 40Gi A100 GPU

Been trying to run quantization for falcon-40b on a box with 8 40Gi A100's but I keep getting CUDA memory errors. The readme states that this should be possible, unless I'm misreading this line:

It may successfully run on GPUs with 32 - 40GB for perplexity evaluation of up to LLaMA-65B and Falcon-40B models.

Here's the command I'm running

python main.py falcon_model/models--tiiuae--falcon-40b/snapshots/c47b371b31a68349c233104050ac76680b8485db custom \
  --custom_data_path=data/refined_web_n=128.pth \
  --wbits 4 \
  --groupsize 16 \
  --perchannel \
  --qq_scale_bits 3 \
  --qq_zero_bits 3 \
  --qq_groupsize 16 \
  --outlier_threshold=0.2 \
  --permutation_order act_order \
  --percdamp 1e0 \
  --nsamples 128

Here's the full command output:

/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
============  Loading model... ============
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            ubuntu
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           ubuntu
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:47<00:00,  5.23s/it]

============ Quantizing model... ============
Loading data ...

Starting SPQR quantization ...
catching inputs from data

---------------- Layer 0 of 60 ----------------
layer_dev_original=device(type='cpu')
Quantizing module self_attention.query_key_value of layer 0                                                                                                                                                                                                                              
Quantizing module self_attention.dense of layer 0                                                                                                                                                                                                                                        
Quantizing module mlp.dense_h_to_4h of layer 0                                                                                                                                                                                                                                           
Quantizing module mlp.dense_4h_to_h of layer 0                                                                                                                                                                                                                                           
Traceback (most recent call last):
  File "main.py", line 549, in <module>
    quantize_model(model, args, device)
  File "main.py", line 73, in quantize_model
    results = quantize_spqr(model, dataloader, args, device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "main.py", line 217, in quantize_spqr
    quantized = spqr_handlers[sublayer_name].quantize(
  File "/home/ubuntu/SpQR/spqr_engine.py", line 84, in quantize
    H = H[perm][:, perm]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 39.56 GiB total capacity; 33.54 GiB already allocated; 2.80 GiB free; 35.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there something I'm doing wrong when launching the command?

Vahe1994 / SpQR

CUDA out of memory falcon-40b when using 40Gi A100 GPU #22