NameError: name 'is_relion_abort' is not defined

3dem / DynaMight

Tool for reconstruction and analysis of continuous heterogeneity of a cryo-EM dataset

Other

20 stars 6 forks source link

NameError: name 'is_relion_abort' is not defined #7

Open heejongkim opened 10 months ago

heejongkim commented 10 months ago

Hi,

As I'm trying to perform "Estimating inverse deformations", I got the following error immediately at the stage of "Assigning a diameter" iteration.

NameError: name 'is_relion_abort' is not defined

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[54749,1],0] Exit code: 2

It seems the issue comes from the following line: https://github.com/3dem/DynaMight/blob/86870f30f79a22ebbb05a40220226911ec64fcda/dynamight/inverse_deformations/optimize_inverse_deformations.py#L236

Not sure it's merely the definition missing or it has a deeper issue from my side as it's under "except:"

If you need any additional logs that may help, please let me know.

Thank you so much.

best, heejong

heejongkim commented 10 months ago

In the same context, I found the following error too.

envs/relion-5.0/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 704.00 MiB (GPU 0; 10.75 GiB total capacity; 9.66 GiB already allocated; 638.44 MiB free; 9.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

However, I saw the same issue during the estimation (step 1) and saw that batch size got automatically adjusted and proceeded. I wonder if this didn't happen in this step?

heejongkim commented 9 months ago

Recently, I got a chance to revisit the issue, and I was able to resolve the issue by changing the batch size to accommodate the VRAM occupancy. It seems like, unlike the estimating motion step, this step doesn't dynamically adjust the batch size.

After finishing that, I'm encountering a new issue with deformed backprojection, which may or may not be connected to inverse deformations. As I resumed the job with backprojection batchsize 2, it went up to the beginning of "start deformable_backprojection of half 1" and it failed at the beginning of the loop. More specifically, https://github.com/3dem/DynaMight/blob/616360b790febf56edf08aef5d4c414058194376/dynamight/deformable_backprojection/backprojection_utils.py#L352 It failed at this without any error msg.

I wonder if it's due to inv_chkpt.pth issue (size 485K), or it's a separate issue from that.

If you need any additional information to narrow down the source of trouble, please let me know.

Thank you.

best, heejong