RosettaCommons / RFdiffusion

Code for running RFdiffusion
Other
1.81k stars 353 forks source link

NVTX missing using SE3nv.yml -- Pytorch 2.0 solution #19

Open tmsincomb opened 1 year ago

tmsincomb commented 1 year ago

Device

OS: CentOS Linux 7 GPU: gtx 1080

Issue

Hi! I get the following error running any of the examples scripts

RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?

When using the current SE3nv.yml I get the following versions

pytorch                   1.9.1           cpu_py39hc5866cc_3    conda-forge
torchaudio                0.9.1           py39                  pytorch
torchvision               0.14.1          cpu_py39h39206e8_1    conda-forge

Solution

I did a clean install running pip3 install --force-reinstall torch torchvision torchaudio

torch                     2.0.0                    pypi_0    pypi
torchaudio                2.0.1                    pypi_0    pypi
torchvision               0.15.1                   pypi_0    pypi

That seems to run every example without an issue. I've come into issues before with conda installs for pytorch when not using the most recent version. Is there a known issue from keeping RFdiffusion from moving to pytorch 2.0?

jvend commented 1 year ago

I also ran across this issue and your solution seems to works (at least on the examples I've tried). Thanks!

jkosinski commented 1 year ago

I got it too, and for me this worked:

conda (or mamba) update --all -c pytorch
broomsday commented 1 year ago

OS: Fedora36 GPU: gtx 1080Ti

I have CUDA 11.8, but your solution worked after I modified the SE3nv.yml to have:

- cudatoolkit=11.7
- dgl-cuda11.7

Note, I had to accept one version lower on the installed toolkit because there is currently no dgl-cuda11.8

Faezov commented 1 year ago

The solution: $ pip3 install --force-reinstall torch torchvision torchaudio worked for me too on RTX4090 on Ubuntu 22.04 I was getting slightly different error though:

Traceback (most recent call last): File "/big18TB/apps/RF/RFdiffusion/./scripts/run_inference.py", line 94, in main px0, x_t, seq_t, plddt = sampler.sample_step( File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/inference/model_runners.py", line 664, in sample_step msa_prev, pair_prev, px0, state_prev, alpha, logits, plddt = self.model(msa_masked, File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/RoseTTAFoldModel.py", line 102, in forward msa, pair, R, T, alpha_s, state = self.simulator(seq, msa_latent, msa_full, pair, xyz[:,:,:3], File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/Track_module.py", line 420, in forward msa_full, pair, R_in, T_in, state, alpha = self.extra_block[i_m](msa_full, File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/Track_module.py", line 332, in forward R, T, state, alpha = self.str2str(msa, pair, R_in, T_in, xyz, state, idx, motif_mask=motif_mask, top_k=0) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 141, in decorate_autocast return func(*args, kwargs) File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/Track_module.py", line 266, in forward shift = self.se3(G, node.reshape(BL, -1, 1), l1_feats, edge_feats) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/big18TB/apps/RF/RFdiffusion/rfdiffusion/SE3_network.py", line 83, in forward return self.se3(G, node_features, edge_features) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 140, in forward basis = basis or get_basis(graph.edata['rel_pos'], max_degree=self.max_degree, compute_gradients=False, File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 167, in get_basis spherical_harmonics = get_spherical_harmonics(relative_pos, max_degree) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 58, in get_spherical_harmonics sh = o3.spherical_harmonics(all_degrees, relative_pos, normalize=True) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 180, in spherical_harmonics return sh(x) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/bulat/anaconda3/envs/SE3nv/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py", line 82, in forward sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2]) RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

define NAN __int_as_float(0x7fffffff)

define POS_INFINITY __int_as_float(0x7f800000)

define NEG_INFINITY __int_as_float(0xff800000)

template device T maximum(T a, T b) { return isnan(a) ? a : (a > b ? a : b); }

template device T minimum(T a, T b) { return isnan(a) ? a : (a < b ? a : b); }

extern "C" global void fused_pow_pow_pow_su_9196483836509741110(float tz_1, float ty_1, float tx_1, float aten_mul, float aten_mul_1, float aten_mul_2, float aten_sub, float aten_add, float aten_mul_3, float aten_pow) { { if (512 blockIdx.x + threadIdx.x<22350 ? 1 : 0) { float ty_1_1 = __ldg(ty_1 + 3 (512 blockIdx.x + threadIdx.x)); aten_pow[512 blockIdx.x + threadIdx.x] = ty_1_1 ty_1_1; float tz_1_1 = __ldg(tz_1 + 3 (512 blockIdx.x + threadIdx.x)); float tx_1_1 = __ldg(tx_1 + 3 (512 blockIdx.x + threadIdx.x)); aten_mul_3[512 blockIdx.x + threadIdx.x] = (float)((double)(tz_1_1 tz_1_1 - tx_1_1 tx_1_1) 0.8660254037844386); aten_add[512 blockIdx.x + threadIdx.x] = tx_1_1 tx_1_1 + tz_1_1 tz_1_1; aten_sub[512 blockIdx.x + threadIdx.x] = ty_1_1 ty_1_1 - (float)((double)(tx_1_1 tx_1_1 + tz_1_1 tz_1_1) 0.5); aten_mul_2[512 blockIdx.x + threadIdx.x] = (float)((double)(ty_1_1) 1.732050807568877) tz_1_1; aten_mul_1[512 blockIdx.x + threadIdx.x] = (float)((double)(tx_1_1) 1.732050807568877) ty_1_1; aten_mul[512 blockIdx.x + threadIdx.x] = (float)((double)(tx_1_1) 1.732050807568877) tz_1_1; } } }

yunfeiyang-go commented 2 months ago

I also come across with the problem. And it is because pytorch is cpu-version, so run conda install -c pytorch pytorch can solve the problem.