consider if `times_seen` of 2 is better than 3

dms-vep / RABV_Pasteur_G_DMS

Deep mutational scan of the Rabies Virus Glycoprotein (RABV-G), Pasteur strain

0 stars 0 forks source link

consider if `times_seen` of 2 is better than 3 #4

Closed jbloom closed 8 months ago

jbloom commented 8 months ago

@arjunaditham, I haven't looked enough to be completely certain which is best, but there is a tradeoff between having a times_seen of 2 versus 3 in the default filters including for your functional effects heatmaps.

A larger value gives more accurate measurements (usually) but at the expensive missing some sites. Given how clean your data are and that you have a single-site mutation library (rather than PCR one), I sort of suspect that you could use a times_seen of 2 and then you would have less missing mutations in your heatmap without much cost in accuracy. This is what Bernadeta did for H5 library, which was also quite precise.

I'm not 100% sure which is best, and it can be assessed by a mix of quantitative metrics and just sort of looking to see how well the data "make sense." But based on a very cursory look, I would lean towards 2 maybe being better than 3 for your data---at least put this on your radar to consider.

arjunaditham commented 8 months ago

Screen Shot 2024-02-14 at 4 18 34 PM

I will leave this issue open for the time being.

I'm not sure of the right/most rigorous way to examine this, but i plotted effect_std versus times_seen to see if the error was unusually high for times_seen = 2. The data seem to reflect that there doesn't seem to be any unusually high error with allowing times_seen to drop to 2. Obviously, as times_seen increases, effect_std drops.

For this plot, I censored effect to be above the mean effect of stop codons to eliminate whether stop codons were causing high error.

jbloom commented 8 months ago

Sounds good. Given this, I would suggest probably dropping times_seen to 2 for new for all filters for now as that is probably best (although not certain).

Then keep an eye on it going forward, and if we decide that is too lax at some point you can bump it up.

The quantitative plots like you have above are good, and we can also get a better sense of this when you have both libraries in there. But a lot of it is honestly just looking at results for specific mutations and asking if they "make sense", and if they don't seeing if the ones that don't have low times_seen.

arjunaditham commented 8 months ago

fun_effects_config.yml has been updated such that min_pre_selection_frac: 0.000001. The previous requirement min_pre_selection_frac: 0.00001 led to exclusion of a lot of variants and was leading to substantially incomplete heatmaps.

min_pre_selection_frac was recalculated assuming 90,000 variants in LibA, so 0.1/90000 = 0.000001. This new parameter is leading to a much more complete heatmap regardless of times_seen of 2 or 3, so I am closing the matter for now.

Please reopen if this is not an appropriate resolution.