Zuricho / ParallelFold

Modified version of Alphafold to divide CPU part (MSA and template searching) and GPU part. This can accelerate Alphafold when predicting multiple structures
https://parafold.sjtu.edu.cn
133 stars 45 forks source link

failed to alloc 2147483648 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS #23

Open yanchenmochen opened 2 years ago

yanchenmochen commented 2 years ago

When I use the code to compute T1050.fasta, which is composed of 700 residuses, the command line output the problem。 The Environment is GPU: A100, Ubuntu,but I use higher version jax and jaxlib, is it the problem causing this?

(parafold) root@node33-a100:~# pip list | grep jax jax 0.3.15 jaxlib 0.3.15+cuda11.cudnn82

yanchenmochen commented 2 years ago

2022-08-17 11:26:20.226278: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 12524123136 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:20.226316: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 12524123136 2022-08-17 11:26:23.693074: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 11271710720 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:23.693112: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 11271710720 2022-08-17 11:26:28.900144: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:28.900185: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 17179869184 2022-08-17 11:26:44.115027: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:44.115072: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 17179869184

Zuricho commented 2 years ago

I'm not sure about this. Maybe it's the jax version issue as you said, but I didn'tmet this before.

yanchenmochen commented 2 years ago

I changed another Machine to Run Protein Prediction, I think Now It is correct now, Maybe It is the jaxlib causing the problem, but the Linux which is used by many staffs.