Selection of 2D classes returns much less particles than expected

RadoDanev commented 2 years ago

After VDAM 2D classification and manual selection of classes, the number of returned particles is much smaller than expected. In this case:

an apoferritin test dataset with extracted ~100k particles
50 classes VDAM with all other parameters at default
manually selected 22 good-looking classes with class distributions going from ~0.05 to ~0.03
the selection job returned only ~12k particles

The issue was confirmed also on a GPCR dataset.

Environment:

OS: Ubuntu 20.04 LTS
MPI runtime: OpenMPI 4.0.3
RELION version RELION-4.0-beta-1-devel-commit-0820b8
Memory: 256 GB
GPU: 4xGTX 2080Ti

Dataset:

Box size: 140 px
Pixel size: 1.48 Å/px
Number of particles: 100,000
Description: apoferritin

Job options:

Type of job: Subset selection
Number of MPI processes: -
Number of threads: -

Full command (see note.txt in the job directory):


`which relion_display` --gui --i Class2D/job011/run_it200_optimiser.star --allow_save --fn_parts Select/job012/particles.star --fn_imgs Select/job012/class_averages.star --recenter   --pipeline_control Select/job012/

RadoDanev commented 2 years ago

Also ran a legacy EM algorithm C2D. The class selection there worked correctly and returned ~90k good particles. It seems that the issue is with the class indexing/naming in the run_data.star of the VDAM run?!?

biochem-fan commented 2 years ago

What happens if you select just one class? Is the number of particles written in the output STAR file the same as the number in Show metadata? Is it also consistent with what you find by awk on the input STAR file? Please also try the same thing by selecting all particles.

Also note that if you stop a Class2D run before the final iteration, some particles might not have passed any iteration and remain unassigned to any class.

RadoDanev commented 2 years ago

Saving a single class shows a similar discrepancy between the rlnClassDistribution in the class metadata and the number of particles. The number of saved particles corresponds to the particle number for the class in the run_data.star. Therefore, the discrepancy is between the rlnClassDistribution and the number of particles assigned to the class in the run_data.star.

biochem-fan commented 2 years ago

the discrepancy is between the rlnClassDistribution and the number of particles assigned to the class in the run_data.star

This is expected. Please see https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1911&L=CCPEM&P=R61249.

The real question is, does the VDAM algorithm produce wider class posterior distributions than the EM algorithm?

RadoDanev commented 2 years ago

The discrepancy is HUGE compared to the Bayesian distribution assignments. For example, a class with ClassDistribution of ~0.005 contains ~25% of the particles when saving it!

biochem-fan commented 2 years ago

What happens if you select one class and relion_reconstruct it (you have to specify it is 2D)? If "the issue is with the class indexing/naming in the run_data.star", the result will be very different from the class average.

biochem-fan commented 2 years ago

Indeed something is very strange.

Total 100438 particles.

First 9 classes from VDAM run_data.star:

$ awk -e '/mrc/ {arr[$3]+=1} END {for (i in arr) {print i" "arr[i]}}' Class2D/first_try/run_it200_data.star 
1 341
2 33671 # the biggest class (33.5 %)
3 21
4 24634
5 374
6 9
7 208
8 7632
9 438
...

rlnClassDistribution from VDAM model.star:

$ awk -e '/mrcs/ {print $4}' Class2D/first_try/run_it200_model.star
0.034537
0.004869 # very small
0.002443
0.004771
0.039661
0.001895
0.002029
0.010276
0.036927
...
$ awk -e '/mrcs/ {print $4}' Class2D/first_try/run_it200_model.star | sort -n | tail
0.039269
0.039661
0.039727
0.039793
0.041427
0.041951
0.044189
0.044471
0.045258
0.048746 # this is the largest class
# there is no class with 33 % occupancy

$ awk -e '/mrcs/ { s+=$4} END{print s}'Class2D/first_try/run_it200_model.star
0.999997 # sum is OK

RadoDanev commented 2 years ago

Yes, the class distributions in the model.star and the visual quality of the corresponding classes are OK. The problem is with the class number assignment in the data.star.

biochem-fan commented 2 years ago

relion_reconstruct --ctf --refdim 2 --o vdam_classNNN.mrc --i Class2D/first_try/run_it200_data.star --class N --skip_gridding

is consistent with run_it200_classes.mrcs (tried class 2 and 14).

So only rlnClassDistribution is wrong?

RadoDanev commented 2 years ago

I think that rlnClassNumber in the data.star is wrong. The good-looking classes contain much fewer particles than they should according to the Class Distribution. For example, classes 24 and 26. Try a selection job and see how many good particles you get. The class auto-selection also returns very few (~10%) good particles, compared to the legacy EM alorithm + class selection.

biochem-fan commented 2 years ago

I think that rlnClassNumber in the data.star is wrong.

I am not so sure, because relion_reconstruct of class 14 reproduced the very bright dot seen in 14@run_it200_classes.mrcs.

But admittedly apoferritin is too symmetrical to judge. Can you try the above relion_reconstruct command line on your GPCR dataset?

RadoDanev commented 2 years ago

OK, we'll try.

RadoDanev commented 2 years ago

I made some plots. There is a very good agreement between the number of particles and rlnClassDistribution in the case of the classic EM algorithm. VDAM is very inconsistent, and the most populous classes are further down the list when sorted by class distribution.

If this is a feature of VDAM and not a bug - please inform users about it and possibly add an option to sort the classes by the number of particles instead of class distribution.

I'll make similar plots for the GPCR dataset and post them later.

RadoDanev commented 2 years ago

Here are the GPCR particle number vs class distribution plots. Sorry, can't show the 2D classes at this stage. Again, very irregular behavior with the VDAM. Almost 90% of the particles went into one class that does not look good.

biochem-fan commented 2 years ago

Did you run relion_reconstruct on GPCR? The most important thing is to see which is wrong, rlnClassDistribution or rlnClassNumber.

Is the selection by eye or by the automated procedure? If this is by eye, why did you avoid the 3rd row, the rightmost column in EM and blue and green classes in VDAM?

biochem-fan commented 2 years ago

@dkimanius just made an update b15acae3. Can you test this?

RadoDanev commented 2 years ago

We'll try running the reconstruct later (the data is being processed by a colleague). From running reconstruct on the apoF, it is not that the class assignments are wrong, just that the classes contain much fewer particles than what the class distribution indicates. The apoF classes have distributions of 0.03 ~ 0.04, but contain just a few hundred particles, approximately ten times less than what the class distribution indicates. And again, most of the particles endes up in a couple of classes that do not look that good.

Is the assigned class in data.star the one with the maximum probability for a given particle? If yes, then how come that a class that contains the majority of the particles has such a low class distributuon?!? Or vice versa, if a class has a 0.03 ~ 0.05 class distribution, why the particles "run away" in a different class with a CD of ~0.01?!?

Is there a way to save the full probability distributions for some particles for debug purposes?

P.S. Sorry, didn't notice the message about the update. Will test it out! P.P.S. About the class selection, yes by eye. For VDAM also tried the auto selector, but it selected exactly the same classes as me. The blue and green classes are quite blurry (not secondary features). The EM classes on the third row right get blurrier and do not contain that many particles. EM_2Dclasses_Screenshot from 2021-10-05 09-58-54 VDAM_2Dclasses_Screenshot from 2021-10-05 09-57-29

scheres commented 2 years ago

I'm looking into this. @dkimanius please have a look too. I cannot reproduce it with Niko's apoferritin test set that we used for the paper (see the table also of selected particles, comparing EM and VDAM in the preprint), but the problem is indeed reproduced with the apoferritin data from @RadoDanev. This is very odd. Thanks for sharing @RadoDanev!

RadoDanev commented 2 years ago

Thanks for checking on this Sjors. Could it be because I used the new 16bit MRC format for the extracted particles?!? I'll test with non-16bit MRCs later.

scheres commented 2 years ago

We used 16-bit float for all our tests too, so I doubt that is the problem.

scheres commented 2 years ago

Am running on the CPU as well, to see whether it might be a GPU-related problem.

scheres commented 2 years ago

@dkimanius just made an update b15acae. Can you test this?

OK, this changes this significantly! Now, the rlnClassDistribution and the number of assigned particles from the data.star file match much better. However, the optimisation solution is actually very bad rado .

RadoDanev commented 2 years ago

Another possible issue - I noticed that a few bad micrographs had an overestimated CTF resolution of 1.26 A and slipped through the CTF resolution filtering.

@scheres In the Relion 4 preprint, the panels in Fig.4 seem to be mislabeled, A&B look like CB1, and C&D like GHD.

biochem-fan commented 2 years ago

@RadoDanev Is it from CTFFIND or GCTF? If it is from CTFFIND, did you use pre-calculated power spectra from motion correction?

RadoDanev commented 2 years ago

@biochem-fan Gctf (blushes) on micrographs without dose-weighting. Also, the movie alignments were done with MotionCor2 (more blushing).

biochem-fan commented 2 years ago

@RadoDanev Look at individual GCTF log files. If it says 1.26 Å, the problem is in GCTF.

RadoDanev commented 2 years ago

@biochem-fan Yes, the overestimation is Gctf's fault, but does the rlnCtfResolution play a role during classification?

biochem-fan commented 2 years ago

No. rlnCtfResolution is only for users. RELION never uses it.

RadoDanev commented 2 years ago

Using 32 bit MRCs for the particles and running on GPUs did not resolve the issue. Will try removing the 1.26 A particles.

RadoDanev commented 2 years ago

Removing the bad micrographs produced even more good-looking classes, but unfortunately did not resolve the issue. :( 33 classes with only ~8k good particles. Edit: This is on commit 137xxxx

biochem-fan commented 2 years ago

@RadoDanev is this using the commit b15acae or later?

RadoDanev commented 2 years ago

@biochem-fan Ugggh, sorry, this is still on the old commit. Went home and forgot to re-compile ... :( ... Will redo the jobs.

scheres commented 2 years ago

@dkimanius and I are onto something: it's the use of multiple optics groups with the VDAM algorithm! All groups higher than 1 get zero maximum probabilities... Somehow the EM algorithm is not affected. More later: we're looking where this goes wrong. Am running one run with only a single optics group and that seems to be going ok (not finished yet).

RadoDanev commented 2 years ago

Re-run the job with the bad micrographs removed (from above) on commit 840cf9, still on GPUs. The classes look very much like the classes that @scheres got from running on CPUs, and the number of particles seems to match, but the top classes are far from great.

RadoDanev commented 2 years ago

@scheres Ouch, I created the optics groups with a script that multiplies the parameters of the single optical group for the imported movies and assigns numbers according to the image shift. Also, just noticed that the acceleration voltage is wrong!!! It should be 200 kV!

RadoDanev commented 2 years ago

I hope I didn't send everyone on a wild goose chase ... The GPCR dataset however was done by another person and is from another microscope (Krios).

scheres commented 2 years ago

But there is still a bug in our code with multiple optics groups. We have a similar problem with our CB1 data set in the paper (which has 2 groups). I have now rerun your data set with a single optics group and the results are great: better class averages and consistent classDistribution with assigned number of particles. We will try and fix the bug asap, but it still is eluding us where it lives exactly. That should be a matter of time though.

RadoDanev commented 2 years ago

Great, hope this leads to resolving an existing issue and not just wasting everyone's time ...

scheres commented 2 years ago

rado1 This is the result with 1 group

RadoDanev commented 2 years ago

This looks legit! :) Good that you mentioned the optics groups and I saw the wrong voltage ... would have banged my head later on and may have suspected Relion 4 ... brrrr

dkimanius commented 2 years ago

Commit 1710855ab1d665677d85b48f65e686ce38fbba05 and a few following commits should have fixed this issue.

Screenshot from 2021-10-05 20-34-45 90,759 particles in selection (attached image).

@RadoDanev Can you verify. Thanks!

RadoDanev commented 2 years ago

@dkimanius Just tested it and it seems to work as intended! Thanks Dari and Relion team!!! :)

3dem / relion

Selection of 2D classes returns much less particles than expected #815