Determine the minimal length of a genomic region that has a probability larger than P of having at least a single insertion.

SATAY-LL / LaanLab-SATAY-DataAnalysis

This contains codes and workflows for data analysis regarding SATAY experiments.

Apache License 2.0

4 stars 3 forks source link

Determine the minimal length of a genomic region that has a probability larger than P of having at least a single insertion. #8

Closed Gregory94 closed 4 years ago

Gregory94 commented 4 years ago

I am developing a bar plot for showing the amount of reads for each transposon insertion in a gene. I have a question about determining the width of the bars. Currently I made the bars so wide that they include a fixed amount of insertions (e.g. each bar includes 8 transposon insertions). But for large regions that are without transposons, or for genes that do not have any insertions at all, I want to determine a maximum width of the bars.

I thought maybe I could set the maximum width such that the probability of finding a single transposon in the region spanned by the bar is greater than 99% (based on the average distance between insertions in the whole chromosome). For this I am using a Poisson distribution (P(x,k)=x^k e^(-x) / k! where x is the total number of insertions divided by the length of the chromosome and k is the actual number of events in the region spanned by the bar). To determine the length of the region that has a probability of 99% to have at least 1 insertion (i.e. 1 - the probability of having zero insertions in a region), I use the equation P(L)=1-e^(-xL) (this is derived from the Poisson distribution). If we want P(L)=0.99, we can solve this equation to L.

For example, for one chromosome I calculated x=31, so L than becomes 144bp. This mean that we need to have a bar width of 144bp to have a probability of 99% to include a transposon. But I am not sure if this mathematically sound. Does anyone have another (better) solution for this?

Gregory94 commented 4 years ago

random_interval_cumalative_plot This is a cumulative plot where I took 10000 runs where in each run I determined a random location in the chromosome and took a region calculated using using the Poisson distribution (i.e. 144bp long regions). Only 72% of the regions have 1 or more insertions.

leilaicruz commented 4 years ago

I am developing a bar plot for showing the amount of reads for each transposon insertion in a gene. I have a question about determining the width of the bars. Currently I made the bars so wide that they include a fixed amount of insertions (e.g. each bar includes 8 transposon insertions). But for large regions that are without transposons, or for genes that do not have any insertions at all, I want to determine a maximum width of the bars.

Question here: why not just to set the width of the bars as the length of the hit-free region per ORF? The width denote the transponson leaps if you like. For genes that do not have any insertion just leave that space empty , simple like this. This is an example I draw for you , example-of-the-bar-plot-for-greg-05052020

I thought maybe I could set the maximum width such that the probability of finding a single transposon in the region spanned by the bar is greater than 99% (based on the average distance between insertions in the whole chromosome).

Dont you have that data available? , I dont get why do you need to estimate a probability of finding a transposon, if you have already that information available

I recommend to go through the data analysis done by Segal et al in the paper: "Gene Essentiality Analyzed by In Vivo Transposon Mutagenesis and Machine Learning in a Stable Haploid Isolate of Candida albicans" where they look to a set of properties , that can guide you in the analysis ,like

(i) the total number of hits per ORF, (ii) the total number of sequence reads per ORF, and (iii) the total number of hits within 100 bp 5= to the ORF, as well as (iv) the ORF length, (v) the “neighborhood index” (the total number of hits per ORF normalized for the hits in surrounding intergenic sequences), and (vi) the longest hit-free region (normalized length of the largest ORF interval without hits)

Gregory94 commented 4 years ago

I dont get why do you need to estimate a probability of finding a transposon, if you have already that information available

I do this because I want to know the length in which the probability of finding a transposon is more than 99%. If I then encounter a region of two or more times this length without any insertions, we can say that this is probably not a coincidence, but something interesting that might indicate an essential region in the chromosome.

Question here: why not just to set the width of the bars as the length of the hit-free region per ORF? The width denote the transponson leaps if you like. For genes that do not have any insertion just leave that space empty

You mean determining the size of the largest region within an orf or gene that has no insertions and set that as the maximum width of the bars?

As an example, this the insertion plot for the HO locus. The little lines in the bar plot indicate the individual insertions. The maximum width of the bars is currently chosen by the average insertion frequency of the chromosome the gene is in times 8. But this I want to change to something else, which this question is about. HOlocus_TnInsertions

For comparison, this is a similar plot for Bem1, where the last part of the gene has hardly any reads. (The violinlot on the right is the number of basepairs between the insertions). bem1_TnInsertions

leilaicruz commented 4 years ago

I do this because I want to know the length in which the probability of finding a transposon is more than 99%. If I then encounter a region of two or more times this length without any insertions, we can say that this is probably not a coincidence, but something interesting that might indicate an essential region in the chromosome.

So, you want to find from the data , what is the length in bp such as the probability of finding NO or a single transposition is 99% , right? One think I can maybe suggest to do is to look into: Linear Regression or Logistic Regression models to fit your data, using the distribution of insertions length as the feature matrix and the number of transposon in each interval as the response variable. You can make it easier if you make your response variable binary ,lets say, 1 if NO or as single transposition is recorded or 0 if more are reported. In python you can use the sklearn package for that : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html These are commonly used methods is you want to make inferences with your data.

You mean determining the size of the largest region within an orf or gene that has no insertions and set that as the maximum width of the bars?

No, I think the violinplot you showed in the right is what I had in mind . I guess you could plot the violin plot for one gene or chromosone separately where the x-axis indicate the basepair positions. So I guess, we wont need the left bar plot, you can encapsulate the information about transposition insertion distances vs where in the genome those transpositions events happen, just using the violinplot.

Just a detail : I dont get the differences in bar widths in your 1st plot

Gregory94 commented 4 years ago

So, you want to find from the data , what is the length in bp such as the probability of finding NO or a single transposition is 99%

I was actually thinking the other way around. So, what is the length in bp such as the probability of finding a single transposition or more is 99%. If you then get a region that has no transposon insertion in a region of more than this length, then this is most likely not a coincidence. I think that setting a probability of 99% of finding no insertions results in very small regions (like 1 or 2 bp regions or so). But maybe these regression models might be useful. I shall look into this.

I guess you could plot the violin plot for one gene or chromosone separately where the x-axis indicate the basepair positions.

You mean dividing up the chromosome in small regions and for each region determine the violinplot and compare this with the violiplot for a gene?

I dont get the differences in bar widths in your 1st plot

basically this is to indicate the density of transposons. Each bin cannot include more than 8 transposons. So if the bin is small, that measn that there many transposon insertions close to each other whereas widre bins indicate that there are transposons, but those well separated from each other. The reason you don't see different bar widths in the second plot is because not many insertions are close to each other. This is also why the median of the violinplot of Bem1 is much higher than that of the chromosome.

This also means that the bar plots might not be strictly necessary, like you mention. But I think it might be useful as a visualization to see where the transposons are to indicate potential essential domains in the gene.

leilaicruz commented 4 years ago

I was actually thinking the other way around. So, what is the length in bp such as the probability of finding a single transposition or more is 99%. If you then get a region that has no transposon insertion in a region of more than this length, then this is most likely not a coincidence.

Right, so lets start from the beginning, to get my head around :)

Primarily what we will get from satay sequencing will be transposition hits and reads per hit and their position in the genome.
Then we would like to quantify those transpositions events in order to classify some genes as prone to be essential or not.
- We are in this point now , right?
- so here , I will try to express what I understand from your thoughts : you want to calculate a kind of metric that will tell us , in each dataset (each strain), whether a gene can be regarded as likely essential/interesting to look into depending on the distances between one transposition hit and the next one. Here I have to questions: If that distance fall in the category (from your probabilistic method) where is very likely to have transpositions insertions there , but actually it was measured a single one ,what happen? or the opposite?
  - This is a kind of intermediate test to evaluate if an "attractive transposition profile" to detect essential genes is a coincidence or not?

You mean dividing up the chromosome in small regions and for each region determine the violinplot and compare this with the violiplot for a gene?

No, I meant more a violinplot per gene , more as a distribution of bp between insertions vs the position in the gene where those insertions fall into.

EKingma commented 4 years ago

I am developing a bar plot for showing the amount of reads for each transposon insertion in a gene. I have a question about determining the width of the bars. Currently I made the bars so wide that they include a fixed amount of insertions (e.g. each bar includes 8 transposon insertions). But for large regions that are without transposons, or for genes that do not have any insertions at all, I want to determine a maximum width of the bars.

I thought maybe I could set the maximum width such that the probability of finding a single transposon in the region spanned by the bar is greater than 99% (based on the average distance between insertions in the whole chromosome). For this I am using a Poisson distribution (P(x,k)=x^k e^(-x) / k! where x is the total number of insertions divided by the length of the chromosome and k is the actual number of events in the region spanned by the bar). To determine the length of the region that has a probability of 99% to have at least 1 insertion (i.e. 1 - the probability of having zero insertions in a region), I use the equation P(L)=1-e^(-xL) (this is derived from the Poisson distribution). If we want P(L)=0.99, we can solve this equation to L.

For example, for one chromosome I calculated x=31, so L than becomes 144bp. This mean that we need to have a bar width of 144bp to have a probability of 99% to include a transposon. But I am not sure if this mathematically sound. Does anyone have another (better) solution for this?

Hi guys, nice way to set up a forum! I had a question about this part Greg, because to me it looks like the x you refer to in the above equation is not the correct one. In your equation, the x should be the expectation value for the number of transposons in a stretch of DNA of length L right? Otherwise I think the k-events should also be expressed in number of transposon insertions per basepair. If you change the expectation value then the equation would become:

P(x,k)=(xL)^k e^(-x*L) / k!

If you use this equation and you would want P = 0.99, then you could solve for L, although from what I can quickly see there is no easy analytical solution for this, so you may have to solve it numerically...

Gregory94 commented 4 years ago

@leilaicruz Yes, what you explain is where we are at this point.

Here I have to questions: If that distance fall in the category (from your probabilistic method) where is very likely to have transpositions insertions there , but actually it was measured a single one ,what happen? or the opposite?

So, if we expect an insertion at a location based on the probability, but instead we find no insertions, than this might be an interesting location. If we do find (mulitple) insertions, than this is what we expect and this region is initially of less interest. For completeness we also have to check the amount of reads per insertion, because sometimes transposon insertions might be present essential regions, but they typically have only one or two reads.

@EKingma You might be right with your equation. But solving this analytically is indeed not trivial. But also numerically this is not always possible since it does not always have a solution. But maybe I need to think about another equation, since I have the feeling that this problem should not be this hard.

EKingma commented 4 years ago

@Gregory94 I had another look and I think we made another mistake there. I was a bit puzzled about the equation above (P = (xL)^k * e^(-xL) / k!) since in that equation the probability of having k-events starts to vanish as the size of the stretch of DNA increases. But actually, this makes complete sense since we use the probability and not the cumulative probability, so for larger chunks of DNA the chance of having just one transposon becomes smaller.

Now, what you could do is to use the cumulative probability function, which would mean you need to integrate the above equation over k from 1 to infinity.

However, to me this actually seems like the more complex approach, what I would actually think is easier to do is simply use the equation above and determine the value of L for which P(k=0|L) is smaller than 0.01. This is actually very easy to do and what you will get is in fact the equation you posted here at the very top of this issue (P(L)=1-e^(-x*L)). Strangely though,when I plug in the numbers I do not get the same value for L as you did, but other than that it seems like an OK approach to me.

Gregory94 commented 4 years ago

@EKingma You are right with the probability equation. The variable x should indeed be multiplied with the length L, also to make the units work out. That the numbers don't match makes sense, I made a mistake when I put the numbers in my initial comment. The variable x represents the number of insertions divided by the length of the genome, so this cannot be 31, but should be 1/31. Then when you plug in the numbers with P(L)=0.99, it should result in something like 143bp. So I should take 143bp regions to get, on average, a 99% chance of having at least 1 insertion in that region.