aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.73k stars 511 forks source link

RuntimeError: Error building extension 'evoformer_attn' #452

Closed agustin-ormazabal closed 3 months ago

agustin-ormazabal commented 3 months ago

Hello, there! I am trying to implement the multimer module of OpenFold by using pre-computed MSAs. As a proof of concept, I am using the protein provided as an example in the tutorial, as well as its corresponding MSA. At the beginning, the MSA are apparently detected, since the output states:

INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/script_utils.py:Successfully loaded JAX parameters at openfold/resources/params/params_model_1_multimer_v3.npz...
INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py:Using precomputed alignments for 6KWC at /nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/All_positives/msas...
INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py:Using precomputed alignments for 6KWC at /nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/All_positives/msas...
INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py:Using precomputed alignments for 6KWC at /nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/All_positives/msas...
INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py:Using precomputed alignments for 6KWC at /nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/All_positives/msas...
INFO:/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/script_utils.py:Running inference for 6KWC-6KWC-6KWC-6KWC...

I think that the issue starts here:

Using /homes/agustin/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /homes/agustin/.cache/torch_extensions/py310_cu121/evoformer_attn/build.ninja...
/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Traceback (most recent call last):
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

And the final error is:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py", line 493, in <module>
    main(args)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/run_pretrained_openfold.py", line 334, in main
    out = run_model(model, processed_feature_dict, tag, args.output_dir)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/script_utils.py", line 160, in run_model
    out = model(batch)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/model.py", line 568, in forward
    outputs, m_1_prev, z_prev, x_prev, early_stop = self.iteration(
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/model.py", line 325, in iteration
    template_embeds = self.embed_templates(
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/model.py", line 142, in embed_templates
    template_embeds = self.template_embedder(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/embedders.py", line 969, in forward
    t = self.template_pair_stack(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/template.py", line 461, in forward
    t, = checkpoint_blocks(
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/checkpointing.py", line 85, in checkpoint_blocks
    return exec(blocks, args)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/checkpointing.py", line 72, in exec
    a = wrap(block(*a))
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/template.py", line 308, in forward
    single = self.tri_att_start_end(single=self.tri_mul_out_in(single=single,
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/template.py", line 223, in tri_att_start_end
    self.tri_att_start(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/triangular_attention.py", line 131, in forward
    x = self._chunk(
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/triangular_attention.py", line 77, in _chunk
    return chunk_layer(
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/utils/chunk_utils.py", line 299, in chunk_layer
    output_chunk = layer(**chunks)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/primitives.py", line 531, in forward
    o = _deepspeed_evo_attn(q, k, v, biases)
  File "/nfs/research/agb/research/agustin/2024/Mycoplasma_pneumoniae_proteome/OpenFold/openfold/openfold/model/primitives.py", line 692, in _deepspeed_evo_attn
    o = DS4Sci_EvoformerAttention(q.to(dtype=torch.bfloat16),
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/deepspeed/ops/deepspeed4science/evoformer_attn.py", line 106, in DS4Sci_EvoformerAttention
    return EvoformerFusedAttention.apply(Q, K, V, biases[0], biases[1])
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/deepspeed/ops/deepspeed4science/evoformer_attn.py", line 71, in forward
    o, lse = _attention(q, k, v, bias1_, bias2_)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/deepspeed/ops/deepspeed4science/evoformer_attn.py", line 24, in _attention
    kernel_ = EvoformerAttnBuilder().load()
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load
    return self.jit_load(verbose)
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load
    op_module = load(name=self.name,
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1309, in load
    return _jit_compile(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/nfs/research/agb/research/agustin/software/localcolabfold/colabfold-conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
    raise RuntimeError(message) from e
**RuntimeError: Error building extension 'evoformer_attn'**

What can it be? I installed OpenFold according to the tutorial.

agustin-ormazabal commented 3 months ago

Dear all, I solved the issue. It was related with the cuda and/or cudann libraries I was loading. At the beginning, I was loading these two ones:

module load cuda/11.8.0
module load cudnn/8.6.0.163-11.8

However, the issue get solved when I also loaded these other three modules:

module load cudnn-8.0.4.30-11.1-gcc-9.3.0-bbr3kjv
module load ant-1.10.0-gcc-9.3.0-xzxbcc6
module load gcc/11.2.0