Flanking insertion sequences

jpalmer37 commented 5 years ago

This is a density curve showing the distribution of nucleotide differences (as proportions of insertion length) found in sequences flanking (before and after) insertion events.

flanking-ins

It looks somewhat bimodal as you predicted. Is there a next step you would perform with this data?

ArtPoon commented 5 years ago

Can you do a histogram instead of a density plot, without normalizing differences by length? Separate lengths as different plots.

jpalmer37 commented 5 years ago

How's this? There were 2 instances of insertion length 1 that I excluded. Aside from that, all other insertion lengths that were skipped had no counts in them.

flanking-ins

ArtPoon commented 5 years ago

Please use one bin per integer count, i.e., separate 0 and 1.

jpalmer37 commented 5 years ago

Is this better? flanking-ins-barplot

ArtPoon commented 5 years ago

how often is the closest match to the left or right (5' or 3')?
how does the above proportion vary with insertion length? with similarity of match?
search for closest match within offset 0 to +/-N nucleotides.
developing a model of insertion by replication slippage (RT)
- estimate parameters of a slip distribution (centered at offset 0)
- parameterize length distribution of insertion
- search the literature for similar model

jpalmer37 commented 5 years ago

This figure shows the distances from an insertion site where a matching sequence (exact match, 1 nt off, 2 nt off) can be found, both in the 5' and 3' directions from the site.

insslip-dist

The total number of retrieved insertions is 111, if you'd like to calculate the proportion of any frequencies shown above. Some proportions that might be of interest:

group	proportion
3', exact match, 0 nt after	0.18
5', exact match, 0 nt after	0.081
3', 1 nt off, 0 nt after	0.31
5', 1 nt off, 0 nt after	0.19

It seems that 3' matches downstream of the insertion are more common than upstream. When factoring in a fuzzy match of 1 nt, these make up almost a third of all insertion cases.

jpalmer37 commented 5 years ago

And here is a rough breakdown of how the proportion of upstream and downstream matches varies with insertion length. For each of these length groupings, I determined the proportion of insertions that had an exact (no fuzzy) match within 3 nt of the insertion site. This also suggests that matching sequences are more commonly found downstream from an insertion.

length group (nt)	proportion containing 5' match	proportion containing 3' match
0-3	0.2692308	0.4615385
4-6	0.05263158	0.1842105
7-12	0.04761905	0.1904762
12+	0.07692308	0.1153846

jpalmer37 commented 5 years ago

For some context, I have a function that performs my analyses on flanking insertion sequences:

insCheck <- function(indel, pos, vseq, wobble, offset=0)

Using different values for wobble (number of fuzzy matches) and offset provides me with different data in the following format:

       before.bool before.offset before.diff      before.seq after.bool after.offset after.diff          after.seq
7494         FALSE            NA          NA                       TRUE            0          0       TTGGAATAGTAC
7641         FALSE            NA          NA                      FALSE           NA         NA                   
7746         FALSE            NA          NA                      FALSE           NA         NA                   
7871          TRUE             0           0             GAA       TRUE            0          0                GAA
7975         FALSE            NA          NA                       TRUE            0          0                AGA
8016         FALSE            NA          NA                      FALSE           NA         NA                   
8017         FALSE            NA          NA                      FALSE           NA         NA                   
8136          TRUE             0           0             AGA       TRUE            0          0                AGA
61000        FALSE            NA          NA                      FALSE           NA         NA                   
11100        FALSE            NA          NA                      FALSE           NA         NA                   
9010         FALSE            NA          NA                      FALSE           NA         NA                   
13110        FALSE            NA          NA                      FALSE           NA         NA                   
15010        FALSE            NA          NA                      FALSE           NA         NA                   
17110        FALSE            NA          NA                      FALSE           NA         NA                   
22610         TRUE             0           0             CTA      FALSE           NA         NA                   
32210        FALSE            NA          NA                       TRUE            0          0       GGCAACTCTAGT
34110        FALSE            NA          NA                      FALSE           NA         NA                   
34510         TRUE             0           0          ATACGG      FALSE           NA         NA                   
36010        FALSE            NA          NA                      FALSE           NA         NA                   
38510        FALSE            NA          NA                       TRUE            0          0             TACGGA
43510        FALSE            NA          NA                      FALSE           NA         NA                   
43710        FALSE            NA          NA                       TRUE            0          0             ACTCTA
51110         TRUE             0           0 AATGCTACTGCCAGC      FALSE           NA         NA                   
51510        FALSE            NA          NA                      FALSE           NA         NA                   
52110        FALSE            NA          NA                      FALSE           NA         NA                   
52610        FALSE            NA          NA                      FALSE           NA         NA                   
53010        FALSE            NA          NA                       TRUE            3          0                ACG
53610        FALSE            NA          NA                       TRUE            0          0       TACCAATGCTAC
19110        FALSE            NA          NA                      FALSE           NA         NA

Where:

bool indicates presence of a match
offset indicates how far the match was away from the insertion
diff indicates how many nucleotides the match differed by
seq is the retrieved sequence of the match

Just thought I'd show this to help us visualize it / in case you have suggestions.

ArtPoon commented 5 years ago

is the distribution of matches (fuzzy or exact) significantly narrower than we expect by chance?
when the match is offset by 1 or more (not adjacent), is the match upstream (5') or downstream (3') of the insertion more often than we expect by chance? Does this vary with insertion length?

jpalmer37 commented 5 years ago

As an attempt to determine whether matches have a significantly narrower distribution than what is expected by chance, I performed a KS test comparing match distances of a null distribution (randomly rearranged vloop sequences sampled x1000) to the observed distribution seen in insertions.

> ks.test(testDist, null_dist)

    Two-sample Kolmogorov-Smirnov test

data:  testDist and null_dist
D = 0.23496, p-value = 0.0002241
alternative hypothesis: two-sided

null-distribution exact-match-distribution

The observed insertion data doesn't have many observations, but I've been working to address this by getting Vlad's data in the analysis (almost complete, just working out bugs).

jpalmer37 commented 5 years ago

Could be useful to see this information: insertion-lengths

ArtPoon commented 5 years ago

Investigate the small number of sequences carrying insertions that are not multiples of 3 - check the Genbank records for annotations of these being defective viral particles due to frame shift - may also need to dig into the associated publications or translate the sequence.

PoonLab / vindels

Flanking insertion sequences #75