compSPI / cryoAI

MIT License
41 stars 6 forks source link

How to load a real world dataset? #3

Open dugu9sword opened 2 years ago

dugu9sword commented 2 years ago

Hi,

I have downloaded the EMPIAR 10049 data following this link: https://github.com/zhonge/cryodrgn_empiar

But there are some questions:

Traceback (most recent call last):
  File "/root/cryoAI/src/reconstruct/main.py", line 273, in <module>
    retval, status_message = main()
  File "/root/cryoAI/src/reconstruct/main.py", line 260, in main
    train(config)
  File "/root/cryoAI/src/reconstruct/train.py", line 20, in experiment
    dataset = StarfileDataLoader(config)
  File "/root/cryoAI/src/dataio.py", line 48, in __init__
    self.true_sidelen = self.df['optics']['rlnImageSize'][0]
  File "/root/miniconda3/envs/cryoai/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/root/miniconda3/envs/cryoai/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'optics'

Error: Training failed.

Thanks!

bHimes commented 2 years ago

Hi @dugu9sword

Based on your error message, your star file is missing the 'rlnImageSize' column in the optics group.

The easiest thing to get around this would be to modify your star file. Optics groups are defined on the relion wiki and have something like five columns if my memory serves.

@fredericpoitevin I think I had run into this too, but haven't had time to work with cryoAI much : (

If I could suggest: This case could be easily caught using a dictionaries get method, which would return 'None' on key error. For some keys, like this one, the value is implicitly defined by the input data, so could be deduced by cryoAI.

dugu9sword commented 2 years ago

Hi @bHimes ,

Thanks a lot for your fruitful suggestions! I checked out the RELION's wiki and found some workaround. I re-used the parameters (of optics) for generating synthetic data of EMPIAR-10028 in CryoAI, and trained on the real-world 10028 dataset (https://github.com/zhonge/cryodrgn_empiar).

"optics": {
                    'rlnVoltage': {0: 300.0},
                    'rlnSphericalAberration': {0: 2.7},
                    'rlnAmplitudeContrast': {0: 0.1},
                    'rlnOpticsGroup': {0: 1},
                    'rlnImageSize': {0: 128},
                    'rlnImagePixelSize': {0: 3.77}
                }, 

The reconstructed volume (at step 52648) seems poor. It is acceptable since amotized inference for pose estimation is a proof-of-concept technology in this area.

I am wondering is there any advice or best practice for runing CryoAI on real-world data? Thanks!

drawing drawing drawing
ff98li commented 1 year ago

I ran into the same key error for missing the optics group in the input .star file when trying to load a real EMPIAR dataset. Based on what I have read in RELION's docs, it appears that the .star file parser used by CryoAI has been assuming a new feature added after RELION 3.1+ i.e. the optics group, which could be missing for cryo-em datasets released before 2020...for now the best way of solving it is probably using RELION 3.1+ to convert the old file format into the new one (with optics group added)... As for @dugu9sword 's question regarding the reconstruction quality, if you take a look at the .star file of the 10028 dataset, you will find that rlnSphericalAberration is actually 2.000000 in the fifth column, rather than the 2.7 used in your input. I presume this could be a possible source of poor reconstruction quality? I'm not 100% cryo-em expert but I hope this can help.

Edit: 2022.12.11

I have found a solution to op's issue of getting KeyError: 'optics'. Again, the problem comes from CryoAI assuming .star file containing the optics group, which is a feature that came out after RELION 3.1+. However, even if you converted your raw .star file (for example, shiny_2sets.star in empiar-10028) to the updated format that has the optics group included by running relion_convert_star with RELION 3.1+, you will still get another key error for missing rlnAngleRot, which is a parameter that would exist only if you had performed 3D refinement beforehand...well, since this is supposed be an ab initio reconstruction pipeline...☹️

Luckily, among the preprocessed files of cryo-em datasets provided by Zhong, the .cs file contains all the information that you need for running CryoAI. So what you need to do for making CryoAI work with empiar-10028 is the following:

Step 1. Install pyem

Step 2. Run csparc2star.py to convert cryosparc_P11_J4_003_particles.cs into a .star file (remember to update your .ini file)

Step 3. If you open your converted .star file, you will notice that the particle stacks .mrcs were placed in a relative path J1/imported/MRC_1901/. So in the directory where you saved your converted .star file, mkdir -p J1/imported and mv the two particle stack directories inside.

Step 4. In your first run, if you encounter this particle invalid warning: In my case, images within these two particle stack files MRC_0601/095_particles_shiny_nb50_new.mrcs and MRC_0601/408_particles_shiny_nb50_new.mrcs are invalid. This is a simple fix: open up your converted .star file and remove records associated with these two files, rerun CryoAI and it should train without issues.

Nevertheless, I'm also getting a poor reconstruction as op's:

Reconstruction for empiar-10028 after 50 epochs (82000 steps): Volume

Losses over 50 epochs: loss