Expected number of INDELs using genomic_distribution

cutleraging commented 9 months ago

Hello,

Thanks for such a great package!

In the genomic_distribution function, I understand that the expected amount of mutations for a region of interest is calculated as

n_muts / surveyed_length * surveyed_region_length

However, does this proved an accurate estimate when dealing with INDELs? I would not think so since n_muts is not equal to the amount of total mutated bases (such as for SNVs).

Any thoughts on a better way to calculate the expected number of INDELs?

One solution I have tried is to randomly shuffle the INDELs (accounting for sequence context) and then count how many are in the region of interest. When I do this, I get a observed/expected ratio of ~1, which is what I would expect. However, I am confused how then I would calculate if this is significant using the binomial_test function. Would it make sense to do something along these lines?

p = n_INDELs /  surveyed_length
n = surveyed_region_length
x = observed_INDELs # number of INDELs observed to land in region of interest from the randomly shuffled files
binomial_test(p, n, x)

Any input would be wonderful, thanks! Ronnie

FreekManders commented 9 months ago

Hi Ronnie,

Indels are generally caused by a single mutational event, so I think it is fair to count each indel as a separate mutation. That indels don't have a size of 1 shouldn't really impact the statistic if you are looking at relatively large regions. If you look at very small regions, then I can imagine that indels will often cross the boundaries of the region, which will influence your results.

Shuffling indels sounds like it is also a valid approach. To calculate statistics you could bootstrap it by repeating the shuffling 1000 times. You then count how many times you find equal or more indels in the region of interest in the bootstrapped data, than in the real data. This number divided by 1000 would then be your p-value.

cutleraging commented 9 months ago

Hi Freek,

Thanks for the insight! That makes sense.

Another question comes to mind... When the observed/expected ratio are calculated, this is in reference to a random distribution of mutations or what might be interpreted as the background distribution for the sample of interest. When I look at intergenic or heterochromatic regions, I usually find an enrichment of mutations (expected/observed > 1).

The question is, could what I am seeing in these intergenic regions actually be considered the background? Put another way, do you think that expected/observed ratios need to be normalized further? A way I can think of might be to use a binning approach across the covered regions of the genome to calculate expected/observed ratios in order to get an average background.

Thanks, Ronnie

FreekManders commented 9 months ago

Hi Ronnie, If you want to compare a specific intergenic region, then using the rest of the intergenic genome as the background might be a good idea. Approaches using binning or rolling averages can work. I have read papers that use binning approaches in the past, so you might be able to find some existing methods. Cancer driver discovery methods also often try to identify regions with more mutations than expected, so you could take a look at those too.

UMCUGenetics / MutationalPatterns

Expected number of INDELs using genomic_distribution #85