von-elfen commented 1 week ago

Hi， I successfully install CryoGEM and download the data.zip then unzip it under testing folder. And I tried to reproduce the output of Ribosome(10048) dataset to test my installation, but something got wrong. I run the command as following:

cryogem gen_data --mode homo --device cuda:0 \ --input_map testing/data/exp_abinitio_volumes/densitymap.10028.90.mrc \ --save_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/ \ --n_micrographs 100 --particle_size 90 --mask_threshold 0.9 # testing dataset cryogem gen_data --mode homo --device cuda:0 \ --input_map testing/data/exp_abinitio_volumes/densitymap.10028.90.mrc \ --save_dir save_images/gen_data/Ribosome\(10028\)/testing_dataset/ \ --n_micrographs 1000 --particle_size 90 --mask_threshold 0.9

cryogem esti_ice --apix 5.36 --device cuda:0 \ --input_dir testing/data/Ribosome\(10028\)/real_data/ \ --save_dir save_images/esti_ice/Ribosome\(10028\)/

Above command run well!

But when i run: cryogem train --name empair-10028-test --max_dataset_size 100 --apix 5.36 --gpu_ids 0 \ --real_dir testing/data/Ribosome\(10028\)/real_data/ \ --sync_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/mics_mrc \ --mask_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/particles_mask \ --weight_map_dir save_images/esti_ice/Ribosome\(10028\)/

Loading real_A: 100%|████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 210.73it/s] Loading weight maps: 100%|███████████████████████████████████████████████████████████| 935/935 [00:03<00:00, 245.15it/s] Loading real_B: 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 53.52it/s] (INFO) (cryogem_model.py) (24-Oct-24 21:52:04) Sampler Type: mask_sample (INFO) (cryogem_model.py) (24-Oct-24 21:52:05) 1,2,3,4,5 model [CryoGEMModel] was created /home/von/anaconda3/envs/cryogem/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:143: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "

How could i solve the problem?

BW!

Jiakai-Zhang commented 4 days ago

Hi,

Thanks for your attention! Please try pulling the latest update in this repo. I updated the training code to omit the warning and show the training progess bar between iterations, which is missing in the previous version that may be misleading. And the warning should not affect the final performance. Also please check out if there is output in the checkpoints/empair-10028-test (our example) or your customized output folder.

Cheers, Jiakai

von-elfen commented 3 days ago

Hello,

Sorry for the misleading. I forgot to copy the error message from the last line:Segmentation fault (core dumped)

gc

git clone https://github.com/Cellverse/CryoGEM.git
cd CryoGEM

conda env and pip

conda create -n cryogem python=3.11 -y
conda activate cryogem
pip install -e .

download zip and unzip to ./test

try to reproduce the following tutorial

1. prepare

cryogem gen_data --mode homo --device cuda:0 \
  --input_map testing/data/exp_abinitio_volumes/densitymap.10028.90.mrc \
  --save_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/ \
  --n_micrographs 100 --particle_size 90 --mask_threshold 0.9

cryogem gen_data --mode homo --device cuda:0 \ --input_map testing/data/exp_abinitio_volumes/densitymap.10028.90.mrc \ --save_dir save_images/gen_data/Ribosome(10028)/testing_dataset/ \ --n_micrographs 1000 --particle_size 90 --mask_threshold 0.9

2. ice layer

cryogem esti_ice --apix 5.36 --device cuda:0 \
  --input_dir testing/data/Ribosome\(10028\)/real_data/ \
  --save_dir save_images/esti_ice/Ribosome\(10028\)/

3. train

cryogem train --name empair-10028-test --max_dataset_size 100 --apix 5.36 --gpu_ids 0 \
  --real_dir testing/data/Ribosome\(10028\)/real_data/ \
  --sync_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/mics_mrc \
  --mask_dir save_images/gen_data/Ribosome\(10028\)/training_dataset/particles_mask \
  --weight_map_dir save_images/esti_ice/Ribosome\(10028\)/

I got stuck at step3:training:

Loading real_A: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 223.02it/s] Loading weight cards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 935/935 [00:03<00:00, 248.08it/s] Loading real_B: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 53.67it/s] (INFO) (cryogem_model.py) (29-Oct-24 22:10:59) Sampler type: mask_sample (INFO) (cryogem_model.py) (29-Oct-24 22:10:59) 1,2,3,4,5 model [CryoGEMModel] was created Epoch 1/25, iters: 0/100: 0%| | 0/100 [00:00<?, ?it/s]Segmentation fault (core dumped)

My platform is Ubuntu 22.04 with 7900x CPU and 4060Ti GPU, and drivers and cuda work fine.

I do not understand "Segmentation fault (core dumped)", and I tried to reinstall and run the command again. But the error was still there.

Looking for your professional solution.

Best Regards.

Jiakai-Zhang commented 2 days ago

Hi,

It seems like you met an error from some incompatible C++ or C files in the main training loop of cryoGEM, it may be caused by the wrong versions of your PyTorch and CUDA in this project's environment.

I suggest that you can:

1) try to disable CUDA but use CPU to see if you can train cryoGEM and try to construct a cuda tensor in the terminal to see if anything wrong happened. 2) try to locate the error by setting CUDA_LAUNCH_BLOCKING=1 in your terminal and print some content out in the main loop of the training code commands/train.py (Line 68 - Line 115) to locate which line are stuck.

Best,

Cellverse / CryoGEM

warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. " #1

gc