Closed jbloom closed 8 months ago
I will leave this issue open for the time being.
I'm not sure of the right/most rigorous way to examine this, but i plotted effect_std
versus times_seen
to see if the error was unusually high for times_seen
= 2. The data seem to reflect that there doesn't seem to be any unusually high error with allowing times_seen
to drop to 2. Obviously, as times_seen
increases, effect_std
drops.
For this plot, I censored effect
to be above the mean effect of stop codons to eliminate whether stop codons were causing high error.
Sounds good. Given this, I would suggest probably dropping times_seen
to 2 for new for all filters for now as that is probably best (although not certain).
Then keep an eye on it going forward, and if we decide that is too lax at some point you can bump it up.
The quantitative plots like you have above are good, and we can also get a better sense of this when you have both libraries in there. But a lot of it is honestly just looking at results for specific mutations and asking if they "make sense", and if they don't seeing if the ones that don't have low times_seen
.
fun_effects_config.yml
has been updated such that min_pre_selection_frac: 0.000001
. The previous requirement min_pre_selection_frac: 0.00001
led to exclusion of a lot of variants and was leading to substantially incomplete heatmaps.
min_pre_selection_frac
was recalculated assuming 90,000 variants in LibA, so 0.1/90000 = 0.000001. This new parameter is leading to a much more complete heatmap regardless of times_seen
of 2 or 3, so I am closing the matter for now.
Please reopen if this is not an appropriate resolution.
@arjunaditham, I haven't looked enough to be completely certain which is best, but there is a tradeoff between having a
times_seen
of 2 versus 3 in the default filters including for your functional effects heatmaps.A larger value gives more accurate measurements (usually) but at the expensive missing some sites. Given how clean your data are and that you have a single-site mutation library (rather than PCR one), I sort of suspect that you could use a
times_seen
of 2 and then you would have less missing mutations in your heatmap without much cost in accuracy. This is what Bernadeta did for H5 library, which was also quite precise.I'm not 100% sure which is best, and it can be assessed by a mix of quantitative metrics and just sort of looking to see how well the data "make sense." But based on a very cursory look, I would lean towards 2 maybe being better than 3 for your data---at least put this on your radar to consider.