System seemed stopped during refine

ChrisLoSK commented 1 year ago

Hi there,

I am a student new to cryo-EM. I am now trying to apply IsoNet on the analysis of my data and I have encountered a problem.

I found that it took extremely long time in the refine step without any response or error messages.

The slurm after 4 hours of running still in Epoch 1/10 stage [as (1) below]. I repeated running with the official tutoral HIV dataset and exactly same commands and parameters according to the tutorial, and got the same problem. The code seemed still running in Epoch 1/10 even after 15 hours. Then I have checked the GPU [nviaid-smi checked in (2) below]. The GPUs seemed are not working(?), while memory is being used. No new files were written in the waiting hours.

Would anyone give me some advice? Thank you very much!

Chris

(1) Slurm log----------------------------------------------------------------------------------- 11-25 10:58:34, INFO

Isonet starts refining

11-25 10:58:38, INFO Start Iteration1! 11-25 10:58:38, WARNING The results folder already exists
The old results folder will be renamed (to results~) 11-25 11:00:31, INFO Noise Level:0.0 11-25 11:01:08, INFO Done preparing subtomograms! 11-25 11:01:08, INFO Start training! 11-25 11:01:10, INFO Loaded model from disk 11-25 11:01:10, INFO begin fitting Epoch 1/10 slurm-37178.out (END)

(2) nvidia-smi --------------------------------------------------------

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1212666 C python3 17747MiB | | 0 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1212666 C python3 17747MiB | | 1 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1212666 C python3 17747MiB | | 2 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB | | 3 N/A N/A 1212666 C python3 17747MiB | | 3 N/A N/A 3278054 G /usr/lib/xorg/Xorg 4MiB |

LianghaoZhao commented 1 year ago

I seems like have the same problem. I wonder if you have solved?

procyontao commented 1 year ago

I found this similar issue: https://github.com/keras-team/keras/issues/11603, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu

ChrisLoSK commented 1 year ago

Dear Dr. Liu,

Thank you very much! I have found one slower machine in the laboratory that could run IsoNet properly. Meanwhile I will compare the driver versions with the cluster (which got the problem) to see if we could fix it.

Best regards, Chris

From: Yuntao Liu @.> Sent: 08 December 2022 18:33 To: IsoNet-cryoET/IsoNet @.> Cc: Lo, Chris @.>; Author @.> Subject: Re: [IsoNet-cryoET/IsoNet] System seemed stopped during refine (Issue #35)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

I found this similar issue: keras-team/keras#11603https://github.com/keras-team/keras/issues/11603, which is related to a cudnn version. The dependencies should match what listed here https://www.tensorflow.org/install/source#gpu

— Reply to this email directly, view it on GitHubhttps://github.com/IsoNet-cryoET/IsoNet/issues/35#issuecomment-1343166302, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4KAEM2VXZ7LKKGKNO6G5F3WMISYBANCNFSM6AAAAAASLMS7CU. You are receiving this because you authored the thread.Message ID: @.***>

LianghaoZhao commented 1 year ago

I finally found useconda install cudatoolkitworked well. It provided essential library for tensorflow. Besides, I found modifing the log level from "info"to "debug" in isonet.py could provided more information.

abhatta2p commented 1 year ago

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best, Arjun

LianghaoZhao commented 1 year ago

Hi all,

a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10.

Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log:

OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device

which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs.

OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0

Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related.

Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution. I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

procyontao commented 1 year ago

Hi @LianghaoZhao,

Thank you for reporting your bug fixation. Would you like review your code in your fork and create a pull request so that it can be merged to the master branch

BhattaArjun2p commented 1 year ago

Hi all, a linux novice here. I have the same issue as OP on a standalone workstation: "refine" job gets stuck in the first iteration at Epoch 1/10. Following @LianghaoZhao's comment, I tried "conda install cudatoolkit", but that did not solve the problem. Changing log level to "debug" though, I could at least identify the issue from the log: OP_REQUIRES failed at xla_ops.cc:417 : INVALID_ARGUMENT: Trying to access resource Resource-1414-at-0x38b1eb20 located in device /job:localhost/replica:0/task:0/device:GPU:0 from device /job:localhost/replica:0/task:0/device:GPU:3 Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device which I assume means GPU 3 is trying to access something on GPU 0. Using only one GPU, I was able get the refinement to progress (hasn't finished yet at the time of writing this post), but I am unsure what might be the causing this issue using multiple GPUs. OS: Ubuntu 20.04.5 GPUs: 4x RTX A5000 Nvidia driver version: 515.86.01 CUDA version: 11.2 cuDNN version: 8.1.1 Python version: 3.7.15 Tensorflow version: 2.11.0 Would be more than happy to provide more info/logs for debugging, if needed. I've been having issues with Tensorflow/Keras with DeePict as well, and wonder if the two issues are somehow related. Best, Arjun

Oh, I finally solved this. I found the exactly same error at last. conda install cudatoolkit is not the correct solution. I found this only happens in multi-GPU environment. And single GPU trained correctly. I change the code in train.py to move model.complie into the scope of with strategy.scope() and solved. I have forked this repo and the last commit in my repo is the solution. My CUDA version, cuDNN version and Tensorflow version is same as you.

Thank you very much for the bug-fix, @LianghaoZhao. I just tried out the newest commit, and it works just fine with multiple GPUs.

IsoNet-cryoET / IsoNet

System seemed stopped during refine #35

Isonet starts refining