CUFFT-type error when running huggingface.py to generate embeddings

salvatoreloguercio commented 6 months ago

Hello, I am using a slightly modified version of the huggingface.py script to generate embeddings from fasta files. I am using the largest model (1Mb window size), and running it on a A100 80Gb.

I just added a loop ad the end of the huggingface.py which loads fasta files and gets embeddings:

for record in records:
            print(record.id)
            sequence = str(record.seq)[0:max_length]
            tok_seq = tokenizer(sequence)
            tok_seq = tok_seq["input_ids"]  # grab ids

            # place on device, convert to tensor
            tok_seq = torch.LongTensor(tok_seq).unsqueeze(0)  # unsqueeze for batch dim
            tok_seq = tok_seq.to(device)

            # prep model and forward
            model.to(device)
            model.eval()
            with torch.inference_mode():
                embeddings = model(tok_seq)

However, after a few hundred iterations I get the following CUFFT error, which seems related to out of memory issues:

Traceback (most recent call last):
  File "huggingface_1Mbp.py", line 271, in <module>
    embeddings = model(tok_seq)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 914, in forward
    hidden_states = self.backbone(input_ids, position_ids=position_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 728, in forward
    hidden_states, residual = layer(hidden_states, residual)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 530, in forward
    hidden_states = self.mixer(hidden_states, **mixer_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 288, in forward
    v = self.filter_fn(v, l_filter, k=k[o], bias=bias[o])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 222, in forward
    y = fftconv(x, k, bias)
  File "/home/hyena-dna/standalone_hyenadna.py", line 53, in fftconv
    k_f = torch.fft.rfft(k, n=fft_size) / fft_size
RuntimeError: cuFFT error: CUFFT_ALLOC_FAILED

So I was wondering, if there is a way to flush the memory between iterations, in order to prevent this kind of error? Thanks!

salvatoreloguercio commented 6 months ago

It might be related to cuda 1.17: see https://discord.com/channels/1125706816479821874/1125706817016696926/1128087480021823518

gonzalobenegas commented 5 months ago

Where you able to solve this? I run into similar issues, randomly while doing inference (I cannot access the discord link btw):

    return forward_call(*args, **kwargs)                                                                                                                         [499/1100]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                   
  File "/tmp/.xdg_cache_gbenegas/huggingface/modules/transformers_modules/LongSafari/hyenadna-large-1m-seqlen-hf/8eb99a87c0bbaf0fec9346d72c60360c3a5b9e33/modeling_hyena.py
", line 158, in forward                                                                                                                                                    
    y = fftconv(x, k, bias)                                                                                                                                                
        ^^^^^^^^^^^^^^^^^^^                                                                                                                                                
  File "/tmp/.xdg_cache_gbenegas/huggingface/modules/transformers_modules/LongSafari/hyenadna-large-1m-seqlen-hf/8eb99a87c0bbaf0fec9346d72c60360c3a5b9e33/modeling_hyena.py
", line 26, in fftconv                                                                                                                                                     
    y = torch.fft.irfft(u_f * k_f, n=fft_size, norm='forward')[..., :seqlen]                                                                                               
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

salvatoreloguercio commented 4 months ago

No, I ended up using the model for fewer iterations, then reloading the image. Just a workaround.

HazyResearch / hyena-dna

CUFFT-type error when running huggingface.py to generate embeddings #40