Testing DynaMight implementation in Relion - Qt error

ColdPopeye commented 1 year ago

I am trying the DynaMight job implemented in Relion 5 on a symmetric dataset and I am running into an error that does not give much explanations to try to solve by myself. Is there any way to continue a stopped Dynamight from the epoch XX? The Qt error which I am not sure how to solve seems to come at random times, once was almost at the start (epoch ~3 which lead me to re-install Qt with sudo apt-get install qtbase5-dev qtchooser qt5-qmake qtbase5-dev-tools just in case) and second time at epoch 23.

Is there something I am missing with the Qt install on this machine?

Environment:

OS: Ubuntu 22.04
MPI runtime: 4.1.2
RELION version 5
Memory: 504G
GPU: 4x RTX A4000
Cuda: 12.1
Driver: 530.30.02

Dataset:

Box size: 550 - Dynamight seems to bin in automatically to 136
Pixel size: 0.73
Number of particles: 164k
Description: C13 ~900kDa

Job options:

Type of job: DynaMight
Number of MPI processes: -
Number of threads: 4
Full command (see note.txt in the job directory):

 relion_python_dynamight optimize-deformations  --refinement-star-file Refine3D/job213/run_data.star --output-directory DynaMight/job223/ --initial-model Refine3D/job213/run_class001.mrc --n-gaussians 10000 --initial-threshold 0.0025 --regularization-factor 1 --n-threads 4 --gpu-id 3  --pipeline-control DynaMight/job223/

**Error message:**

Please cite the *full* error message as the example below.
In the run.out:
Epoch: 23 Epoch time: 903.3148272037506
----------------------------------------------
new regularization parameter for half 1 is  0.19544872978197358
new regularization parameter for half 2 is  0.17584381904016366
100%|##########| 579/579 [05:38<00:00,  1.71it/s]
updated consensus model of half 1
100%|##########| 579/579 [05:37<00:00,  1.72it/s]
updated consensus model of half 2
**Cannot load backend 'QtAgg' which requires the 'qt' interactive framework, as 'headless' is currently running**

In the run.err I get two warnings: one related to how Dynamight bins the sample (not sure why it automatically bins differently I will try to bin before running for next run):

/home/relion/miniconda3/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:233: UserWarning: Using a target size (torch.Size([275, 275, 275])) that is different to the input size (torch.Size([274, 274, 274])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. loss = torch.nn.functional.mse_loss(

/home/relion/miniconda3/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:233: UserWarning: Using a target size (torch.Size([137, 137, 137])) that is different to the input size (torch.Size([136, 136, 136])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

/home/relion/miniconda3/envs/relion-5.0/lib/python3.10/site-packages/torch/nn/functional.py:3737: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.

schwabjohannes commented 1 year ago

Did you run a continuation job with the visualization, when this error appeared? Dynamight only downsamples the output volumes that are saved, because of gpu memory.

At the moment continuation from epoch XX is not possible, but this will be added soon.

ColdPopeye commented 1 year ago

Did you run a continuation job with the visualization, when this error appeared? Dynamight only downsamples the output volumes that are saved, because of gpu memory.

At the moment continuation from epoch XX is not possible, but this will be added soon.

Yes, visualization on the last epoch was fine and now running the inverse-deformation estimation. I also tried to run a similar thing outside of relion (I did see I can add a mask so I did that as extra):

(from the conda environment relion-5.0)

dynamight optimize-deformations --refinement-star-file Refine3D/job213/run_data.star --output-directory DynaMight/job224/ --initial-model Refine3D/job213/run_class001.mrc --n-gaussians 10000 --mask-file mask_job213_run_class001.mrc --gpu-id 1

But the error was the same Qt error and again once at Epoch 27 and once at Epoch 1:

Cannot load backend 'QtAgg' which requires the 'qt' interactive framework, as 'headless' is currently running

ColdPopeye commented 1 year ago

Installed Xvbf and ran: Xvfb :1 -screen 0 1280x1024x24 & export DISPLAY=:1 DISPLAY=:1 dynamight optimize-deformations --refinement-star-file Refine3D/job213/run_data.star --output-directory DynaMight/job224/ --initial-model Refine3D/job213/run_class001.mrc --n-gaussians 10000 --mask-file mask_job213_run_class001.mrc --gpu-id 1

seems to prevent it from crashing on my server. Still no idea why it needs a fake display during the run, I assume is writing some files that for some reason require a display? Not very familiar with the Qt

3dem / relion

Testing DynaMight implementation in Relion - Qt error #1005