Closed RadoDanev closed 2 years ago
Also ran a legacy EM algorithm C2D. The class selection there worked correctly and returned ~90k good particles. It seems that the issue is with the class indexing/naming in the run_data.star of the VDAM run?!?
What happens if you select just one class? Is the number of particles written in the output STAR file the same as the number in Show metadata
? Is it also consistent with what you find by awk
on the input STAR file? Please also try the same thing by selecting all particles.
Also note that if you stop a Class2D run before the final iteration, some particles might not have passed any iteration and remain unassigned to any class.
Saving a single class shows a similar discrepancy between the rlnClassDistribution in the class metadata and the number of particles. The number of saved particles corresponds to the particle number for the class in the run_data.star. Therefore, the discrepancy is between the rlnClassDistribution and the number of particles assigned to the class in the run_data.star.
the discrepancy is between the rlnClassDistribution and the number of particles assigned to the class in the run_data.star
This is expected. Please see https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1911&L=CCPEM&P=R61249.
The real question is, does the VDAM algorithm produce wider class posterior distributions than the EM algorithm?
The discrepancy is HUGE compared to the Bayesian distribution assignments. For example, a class with ClassDistribution of ~0.005 contains ~25% of the particles when saving it!
What happens if you select one class and relion_reconstruct
it (you have to specify it is 2D)? If "the issue is with the class indexing/naming in the run_data.star", the result will be very different from the class average.
Indeed something is very strange.
Total 100438 particles.
First 9 classes from VDAM run_data.star:
$ awk -e '/mrc/ {arr[$3]+=1} END {for (i in arr) {print i" "arr[i]}}' Class2D/first_try/run_it200_data.star
1 341
2 33671 # the biggest class (33.5 %)
3 21
4 24634
5 374
6 9
7 208
8 7632
9 438
...
rlnClassDistribution from VDAM model.star:
$ awk -e '/mrcs/ {print $4}' Class2D/first_try/run_it200_model.star
0.034537
0.004869 # very small
0.002443
0.004771
0.039661
0.001895
0.002029
0.010276
0.036927
...
$ awk -e '/mrcs/ {print $4}' Class2D/first_try/run_it200_model.star | sort -n | tail
0.039269
0.039661
0.039727
0.039793
0.041427
0.041951
0.044189
0.044471
0.045258
0.048746 # this is the largest class
# there is no class with 33 % occupancy
$ awk -e '/mrcs/ { s+=$4} END{print s}'Class2D/first_try/run_it200_model.star
0.999997 # sum is OK
Yes, the class distributions in the model.star and the visual quality of the corresponding classes are OK. The problem is with the class number assignment in the data.star.
relion_reconstruct --ctf --refdim 2 --o vdam_classNNN.mrc --i Class2D/first_try/run_it200_data.star --class N --skip_gridding
is consistent with run_it200_classes.mrcs
(tried class 2 and 14).
So only rlnClassDistribution
is wrong?
I think that rlnClassNumber in the data.star is wrong. The good-looking classes contain much fewer particles than they should according to the Class Distribution. For example, classes 24 and 26. Try a selection job and see how many good particles you get. The class auto-selection also returns very few (~10%) good particles, compared to the legacy EM alorithm + class selection.
I think that rlnClassNumber in the data.star is wrong.
I am not so sure, because relion_reconstruct
of class 14 reproduced the very bright dot seen in 14@run_it200_classes.mrcs
.
But admittedly apoferritin is too symmetrical to judge. Can you try the above relion_reconstruct
command line on your GPCR dataset?
OK, we'll try.
I made some plots. There is a very good agreement between the number of particles and rlnClassDistribution in the case of the classic EM algorithm. VDAM is very inconsistent, and the most populous classes are further down the list when sorted by class distribution.
If this is a feature of VDAM and not a bug - please inform users about it and possibly add an option to sort the classes by the number of particles instead of class distribution.
I'll make similar plots for the GPCR dataset and post them later.
Here are the GPCR particle number vs class distribution plots. Sorry, can't show the 2D classes at this stage. Again, very irregular behavior with the VDAM. Almost 90% of the particles went into one class that does not look good.
Did you run relion_reconstruct
on GPCR? The most important thing is to see which is wrong, rlnClassDistribution or rlnClassNumber.
Is the selection by eye or by the automated procedure? If this is by eye, why did you avoid the 3rd row, the rightmost column in EM and blue and green classes in VDAM?
@dkimanius just made an update b15acae3. Can you test this?
We'll try running the reconstruct later (the data is being processed by a colleague). From running reconstruct on the apoF, it is not that the class assignments are wrong, just that the classes contain much fewer particles than what the class distribution indicates. The apoF classes have distributions of 0.03 ~ 0.04, but contain just a few hundred particles, approximately ten times less than what the class distribution indicates. And again, most of the particles endes up in a couple of classes that do not look that good.
Is the assigned class in data.star the one with the maximum probability for a given particle? If yes, then how come that a class that contains the majority of the particles has such a low class distributuon?!? Or vice versa, if a class has a 0.03 ~ 0.05 class distribution, why the particles "run away" in a different class with a CD of ~0.01?!?
Is there a way to save the full probability distributions for some particles for debug purposes?
P.S. Sorry, didn't notice the message about the update. Will test it out! P.P.S. About the class selection, yes by eye. For VDAM also tried the auto selector, but it selected exactly the same classes as me. The blue and green classes are quite blurry (not secondary features). The EM classes on the third row right get blurrier and do not contain that many particles.
I'm looking into this. @dkimanius please have a look too. I cannot reproduce it with Niko's apoferritin test set that we used for the paper (see the table also of selected particles, comparing EM and VDAM in the preprint), but the problem is indeed reproduced with the apoferritin data from @RadoDanev. This is very odd. Thanks for sharing @RadoDanev!
Thanks for checking on this Sjors. Could it be because I used the new 16bit MRC format for the extracted particles?!? I'll test with non-16bit MRCs later.
We used 16-bit float for all our tests too, so I doubt that is the problem.
Am running on the CPU as well, to see whether it might be a GPU-related problem.
@dkimanius just made an update b15acae. Can you test this?
OK, this changes this significantly! Now, the rlnClassDistribution and the number of assigned particles from the data.star file match much better. However, the optimisation solution is actually very bad .
Another possible issue - I noticed that a few bad micrographs had an overestimated CTF resolution of 1.26 A and slipped through the CTF resolution filtering.
@scheres In the Relion 4 preprint, the panels in Fig.4 seem to be mislabeled, A&B look like CB1, and C&D like GHD.
@RadoDanev Is it from CTFFIND or GCTF? If it is from CTFFIND, did you use pre-calculated power spectra from motion correction?
@biochem-fan Gctf (blushes) on micrographs without dose-weighting. Also, the movie alignments were done with MotionCor2 (more blushing).
@RadoDanev Look at individual GCTF log files. If it says 1.26 Å, the problem is in GCTF.
@biochem-fan Yes, the overestimation is Gctf's fault, but does the rlnCtfResolution play a role during classification?
No. rlnCtfResolution
is only for users. RELION never uses it.
Using 32 bit MRCs for the particles and running on GPUs did not resolve the issue. Will try removing the 1.26 A particles.
Removing the bad micrographs produced even more good-looking classes, but unfortunately did not resolve the issue. :( 33 classes with only ~8k good particles. Edit: This is on commit 137xxxx
@RadoDanev is this using the commit b15acae or later?
@biochem-fan Ugggh, sorry, this is still on the old commit. Went home and forgot to re-compile ... :( ... Will redo the jobs.
@dkimanius and I are onto something: it's the use of multiple optics groups with the VDAM algorithm! All groups higher than 1 get zero maximum probabilities... Somehow the EM algorithm is not affected. More later: we're looking where this goes wrong. Am running one run with only a single optics group and that seems to be going ok (not finished yet).
Re-run the job with the bad micrographs removed (from above) on commit 840cf9, still on GPUs. The classes look very much like the classes that @scheres got from running on CPUs, and the number of particles seems to match, but the top classes are far from great.
@scheres Ouch, I created the optics groups with a script that multiplies the parameters of the single optical group for the imported movies and assigns numbers according to the image shift. Also, just noticed that the acceleration voltage is wrong!!! It should be 200 kV!
I hope I didn't send everyone on a wild goose chase ... The GPCR dataset however was done by another person and is from another microscope (Krios).
But there is still a bug in our code with multiple optics groups. We have a similar problem with our CB1 data set in the paper (which has 2 groups). I have now rerun your data set with a single optics group and the results are great: better class averages and consistent classDistribution with assigned number of particles. We will try and fix the bug asap, but it still is eluding us where it lives exactly. That should be a matter of time though.
Great, hope this leads to resolving an existing issue and not just wasting everyone's time ...
This is the result with 1 group
This looks legit! :) Good that you mentioned the optics groups and I saw the wrong voltage ... would have banged my head later on and may have suspected Relion 4 ... brrrr
Commit 1710855ab1d665677d85b48f65e686ce38fbba05 and a few following commits should have fixed this issue.
90,759 particles in selection (attached image).
@RadoDanev Can you verify. Thanks!
@dkimanius Just tested it and it seems to work as intended! Thanks Dari and Relion team!!! :)
After VDAM 2D classification and manual selection of classes, the number of returned particles is much smaller than expected. In this case:
The issue was confirmed also on a GPCR dataset.
Environment:
Dataset:
Job options:
note.txt
in the job directory):