Closed jpalmer37 closed 4 years ago
Can you do a histogram instead of a density plot, without normalizing differences by length? Separate lengths as different plots.
How's this? There were 2 instances of insertion length 1 that I excluded. Aside from that, all other insertion lengths that were skipped had no counts in them.
Please use one bin per integer count, i.e., separate 0 and 1.
Is this better?
This figure shows the distances from an insertion site where a matching sequence (exact match, 1 nt off, 2 nt off) can be found, both in the 5' and 3' directions from the site.
The total number of retrieved insertions is 111, if you'd like to calculate the proportion of any frequencies shown above. Some proportions that might be of interest:
group | proportion |
---|---|
3', exact match, 0 nt after | 0.18 |
5', exact match, 0 nt after | 0.081 |
3', 1 nt off, 0 nt after | 0.31 |
5', 1 nt off, 0 nt after | 0.19 |
It seems that 3' matches downstream of the insertion are more common than upstream. When factoring in a fuzzy match of 1 nt, these make up almost a third of all insertion cases.
And here is a rough breakdown of how the proportion of upstream and downstream matches varies with insertion length. For each of these length groupings, I determined the proportion of insertions that had an exact (no fuzzy) match within 3 nt of the insertion site. This also suggests that matching sequences are more commonly found downstream from an insertion.
length group (nt) | proportion containing 5' match | proportion containing 3' match |
---|---|---|
0-3 | 0.2692308 | 0.4615385 |
4-6 | 0.05263158 | 0.1842105 |
7-12 | 0.04761905 | 0.1904762 |
12+ | 0.07692308 | 0.1153846 |
For some context, I have a function that performs my analyses on flanking insertion sequences:
insCheck <- function(indel, pos, vseq, wobble, offset=0)
Using different values for wobble
(number of fuzzy matches) and offset
provides me with different data in the following format:
before.bool before.offset before.diff before.seq after.bool after.offset after.diff after.seq
7494 FALSE NA NA TRUE 0 0 TTGGAATAGTAC
7641 FALSE NA NA FALSE NA NA
7746 FALSE NA NA FALSE NA NA
7871 TRUE 0 0 GAA TRUE 0 0 GAA
7975 FALSE NA NA TRUE 0 0 AGA
8016 FALSE NA NA FALSE NA NA
8017 FALSE NA NA FALSE NA NA
8136 TRUE 0 0 AGA TRUE 0 0 AGA
61000 FALSE NA NA FALSE NA NA
11100 FALSE NA NA FALSE NA NA
9010 FALSE NA NA FALSE NA NA
13110 FALSE NA NA FALSE NA NA
15010 FALSE NA NA FALSE NA NA
17110 FALSE NA NA FALSE NA NA
22610 TRUE 0 0 CTA FALSE NA NA
32210 FALSE NA NA TRUE 0 0 GGCAACTCTAGT
34110 FALSE NA NA FALSE NA NA
34510 TRUE 0 0 ATACGG FALSE NA NA
36010 FALSE NA NA FALSE NA NA
38510 FALSE NA NA TRUE 0 0 TACGGA
43510 FALSE NA NA FALSE NA NA
43710 FALSE NA NA TRUE 0 0 ACTCTA
51110 TRUE 0 0 AATGCTACTGCCAGC FALSE NA NA
51510 FALSE NA NA FALSE NA NA
52110 FALSE NA NA FALSE NA NA
52610 FALSE NA NA FALSE NA NA
53010 FALSE NA NA TRUE 3 0 ACG
53610 FALSE NA NA TRUE 0 0 TACCAATGCTAC
19110 FALSE NA NA FALSE NA NA
Where:
Just thought I'd show this to help us visualize it / in case you have suggestions.
As an attempt to determine whether matches have a significantly narrower distribution than what is expected by chance, I performed a KS test comparing match distances of a null distribution (randomly rearranged vloop sequences sampled x1000) to the observed distribution seen in insertions.
> ks.test(testDist, null_dist)
Two-sample Kolmogorov-Smirnov test
data: testDist and null_dist
D = 0.23496, p-value = 0.0002241
alternative hypothesis: two-sided
The observed insertion data doesn't have many observations, but I've been working to address this by getting Vlad's data in the analysis (almost complete, just working out bugs).
Could be useful to see this information:
Investigate the small number of sequences carrying insertions that are not multiples of 3 - check the Genbank records for annotations of these being defective viral particles due to frame shift - may also need to dig into the associated publications or translate the sequence.
This is a density curve showing the distribution of nucleotide differences (as proportions of insertion length) found in sequences flanking (before and after) insertion events.
It looks somewhat bimodal as you predicted. Is there a next step you would perform with this data?