How to interpret SATAY data in order to have meaningful information from it?

Wteunisse commented 4 years ago

Some additional comments on how to interpret the data were made In the meeting with Werner.

There is a sequencing bias in the number of reads. We probably cannot do anything about this, but the sequencing may add an extra layer of variance to the number of reads.
Also, during the sequencing, there is a chance of observing a transposon or not. I don't think I fully understand this problem yet, but Werner suggested that we should look into a 'negative binomial distribution'. This because we only know the observed number of transposons but this might not be equal to the actual number of transposons.

wdaalman commented 4 years ago

To help you further along, the actual cells with transposons are converted into reads such that the reads follow a binomial distribution. Since we have the inverse problem, if you know the reads and if you would know the probability that a cell with transposon turns into a read, the actual number of cells with transposon (including the unobserved ones) would follow a negative binomial distribution.

However, we cannot easily use the negative binomial distribution to invert reads to actual transposons, since Wessel mentioned today we don't know that probability, I thought you could try something else, namely finding the best fitting binomial distribution. Unfortunately Matlab's mle wants to have the probability parameter fixed, so I wrote a small script in Matlab using the generalized method of moments instead to fit simulated read data. This works reasonably well (run Reads_transposon_conversion_simulation_v2.m in the zip file). Reads transposon conversion v2.zip

Two caveats can be that: (Updated to v2 to resolve first caveat: ~~1) we do not know which regions have no reads because they are unlucky in read-out or because they are very unfit. This gives a bias.~~) 2) In constructing a read distribution across the DNA including Wessel's normalization, we have not corrected for fitness bias. So an idea could be to first do this only for non-coding regions, get the probability parameter estimate, and use that on the real genes and invert reads to transpsons there using the negative binomial distribution.

Wteunisse commented 4 years ago

Very interesting, I will look into it! One thought I had about the probability is that we might be able to estimate the total number of cells during the SATAY experiment. I think Benoît also mentions a number in his paper, from this we know how much transpositions have taken place. So maybe we can have a good estimation of the probability of actually reading transposition.

wdaalman commented 4 years ago

That sounds good, it would be reassuring to see if there is a reasonable match with the fitted estimate. Should you find out the probability is rather low, this implies noise willl be high (intuitively if almost every transposon is a read there is almost no noise). In that case, to dinstinguish noise from fitness effects of the transposon, you could think of increasing the duration of the growth phase to accentuate fitness effects.

Gregory94 commented 4 years ago

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

leilaicruz commented 4 years ago

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

Did you could download the paper? I could not ...

Gregory94 commented 4 years ago

I saw this paper that discusses normalization using various statistical approaches, for example the negative binomial distribution. Maybe it is useful.

Did you could download the paper? I could not ...

Dejesus2016_NORMALIZATION OF TRANSPOSON-MUTANT LIBRARY SEQUENCING DATASETS TO IMPROVE IDENTIFICATION OF CONDITIONALLY ESSENTIAL GENES.pdf

leilaicruz commented 4 years ago

Interesting that those papers: "NORMALIZATION OF TRANSPOSON-MUTANT LIBRARY SEQUENCING DATASETS TO IMPROVE IDENTIFICATION OF CONDITIONALLY ESSENTIAL GENES" and "Statistical analysis of genetic interactions in Tn-Seq data" are from the same author Michael A. DeJesus from Department of Computer Science, Texas A&M University

leilaicruz commented 4 years ago

@Gregory94 you should watch and take a look at the repo from the same author (Michael A. DeJesus): https://github.com/mad-lab/tools It seems very useful ....

Gregory94 commented 4 years ago

@Gregory94 you should watch and take a look at the repo from the same author (Michael A. DeJesus): https://github.com/mad-lab/tools It seems very useful ....

Yes, indeed. But I think for many tools they created, it is optimized for their experimental setup which is different from ours. We should think whether we want to use a similar experimental approach as they had or change the tools they have and alter them for our approach.

leilaicruz commented 4 years ago

Yes they are optimized to the type of data they get and with the vision they have to analyze those datasets. However still can be useful in terms of how they implemented it and some parts of the statistical analyses could be just abstracted from their use to ours. It looks very organized at first look , and in general it is always of great benefit to have good examples of well organized and structure code from where we can learn, build and collaborate .

SATAY-LL / LaanLab-SATAY-DataAnalysis

How to interpret SATAY data in order to have meaningful information from it? #27