heterozygosity structure section

KamilSJaron commented 5 years ago

This section is a real pain in my butt. I find it less and less trustworthy, but at the same time, it's one of the very novel sections that is at least trying something new.

The original idea was to

get the heterozygosity structure of everything, look at those known (Meloidogyne hybrids; rotifers etc) and guess something about those that are unknown

We first had a problem with estimates in rotifers, as they were very off the reality. This is solved now thanks to @reubwn. The second problem was with controversy regarding the genome structure of root-knot nematodes (one of the reviewers did not like that we took for granted the 'one diverged hybrid copy present'). That does not affect the visualization, but it makes the interpretation of the plot harder.

However, with a fixed scale for rotifers, the others became unreadable - all squashed together.

Then we figured, that one needs to focus on relative sizes between different heterozygosity topologies anyway, so we divided each bar by it's size, getting

First I thought it's perfect, showing clearly both structure in triploids and tetraploids, while the absolute values are already shown in Figure 2. But then I realized that even the relative counts are absolute value dependent. We made the argument, that more equidistant genomic copies are in triploids, higher fraction of ABC loci there will be. However, even this expectation is dependent on the absolute value. I.e. how low heterozygosity organism, there will be very small fraction of ABC loci even if the genomic copies will be equidistant, contrasting to highly heterozygous species where even small divergence between the two more closely related haplotypes cause a relatively high proportion of ABC loci. I made some naive calculations for our species and it really does matter.

So, back to the original plot. However, now to make it more readable for triploids, I made it a two-panel plot for triploid and tetraploids. What do you think?

Figure_3_tetraploid_heterozygosity_two_plots

Aesthetics can be polished, the question is if the different scale axes are confusing or not. To make it more intuitive we can scale triploids from 0 to 10, so they appear visually smaller or something like that. Just to improve intuitive reading of the plot.

This is linked to, I am not sure anymore how to phrase this section. AGAIN! Suggestions welcomed.

KamilSJaron commented 5 years ago

The version of the first figure including annotaations. I also sorted columns so they make more sense (primary sorting still by taxa, but secondary sorting by heterozygosity)

Figure_3_tetraploid_heterozygosity_v3

KamilSJaron commented 5 years ago

TLDR; It would be good to generate a meaningful naive expectation for fraction of ABC loci detected given completely equidistant triploid genome and the total measured heterozygosity. Any ideas @tbenavi1 ?

I managed to brainstorm my labmates on this issue. Basically, we boiled it down to expressing an expectation of triploid loci given the genomic structure. On case, two identical, one different haplotype is trivial (ABC = 0), the equidistant scenario is more difficult, as our heterozygosity lumps together bi and triallelic loci we could express it as AB - AB * AC + BC = est_heterozygosity, i.e. if A is the reference haplotype, then the heterozygosity is the distance of haplotype A to B plus distance of A to C, minus the random overlap of the two AB * AC. With that, we can relatively easily calculate the naive expecation for triallelic loci and we get something like:

Then is sort of easy to compare the measured ABC heterozygosities with expected ABC heterozygosities in the most extreme (equidistant) case, that would look like this

Everything over the line is "above the expecation" (i.e. MORE heterozygous than the MOST EXTREME case by chance). However, this does not be something completely wrong as the distribution of alleles is very non-random on the genome. However, the deviation from the expecation could be used as a measure of "equdistantnes of genomic copies" corrected for heterozygosity.

However, now I am scratching my head whether the way I think of one genomic copy is appropriate or not...

KamilSJaron commented 5 years ago

Ha, I think I got it moreless right. In the model I wrote above, I did not consider that the two divergent genomic copies to the reference genomic copy could share the veriant and therefore being again bi-allelic instead of triallelic, but otherwise I think it's correct.

I made superexplicit simulations just to be sane.

genome_size = 10000

generate_derived_haplotype <- function(ancestor, genome_size, mutaion_p){
    muts <- rpois(1, round(mutaion_p * genome_size))
    mut_locations <- round(runif(muts, 1, genome_size))
    derived_seq <- ancestor
    for (pos in mut_locations){
        derived_seq[pos] <- sample(c('A', 'C', 'T', 'G'), 1)
    }
    derived_seq
}

calculate_divergence <- function(mutaion_p){
    ancestor <- sample(c('A', 'C', 'T', 'G'), size = genome_size, replace = T)

    h1 <- generate_derived_haplotype(ancestor, genome_size, mutaion_p)
    h2 <- generate_derived_haplotype(ancestor, genome_size, mutaion_p)
    h3 <- generate_derived_haplotype(ancestor, genome_size, mutaion_p)

    AB_dis = sum(h1 != h2)
    BC_dis = sum(h2 != h3)
    AC_dis = sum(h1 != h3)

    triallelic = sum(h1 != h2 & h2 != h3 & h1 != h3)
    het = sum(((h1 != h2) + (h2 != h3) + (h1 != h3)) > 1)

    c(mutaion_p, c(het, triallelic, AB_dis, BC_dis, AC_dis) / genome_size)
}

make_replicates <- function(mutaion_p, replicates){
    results_mat <- sapply(rep(mutaion_p, replicates), calculate_divergence)
    apply(results_mat, 1, median)
}

muts_to_simulate <- c(seq(0.001,0.01, by = 0.002),
                      seq(0.01,0.04, by = 0.005),
                      seq(0.05,0.3, by = 0.05))
summary_per_mutation_rate <- as.data.frame(t(sapply(, make_replicates, 200)))
colnames(summary_per_mutation_rate) <- c('mutation_p', 'het', 'triallelic', 'AB_dis', 'BC_dis', 'AC_dis')

more_data <- as.data.frame(t(sapply(seq(0.01,0.04, by = 0.01), make_replicates, 200)))
colnames(more_data) <- c('mutation_p', 'het', 'triallelic', 'AB_dis', 'BC_dis', 'AC_dis')

summary_per_mutation_rate <- rbind(more_data, summary_per_mutation_rate)

more_data <- as.data.frame(t(sapply(, make_replicates, 200)))
colnames(more_data) <- c('mutation_p', 'het', 'triallelic', 'AB_dis', 'BC_dis', 'AC_dis')

plot(c(sapply(heter, get_half)^2) ~ heter, type = 'l', col = 'green', xlab = 'heterozygosity [%]', ylab = 'naively expected overlap [%]', xlim = c(0, 0.1))
points(summary_per_mutation_rate$het, summary_per_mutation_rate$triallelic, pch = 20)

ANd it looks like this. Now, the previous model is in green and the sims are the black dots. The match is very reasonable, I guess the deviation is the consideration of mutation to the same position.

Anyway, so the expecation probably can be used.

KamilSJaron commented 5 years ago

Alright.

I phrased it WITHOUT the expected fraction of triallelic loci in the main text and added seventh supplementary material (S7). THe supplementary Figure 8 will get polished if we will agree to include this section too.

One important change is that AAB, ABC etc is used only to label genome structures, not individual locus topologies as in the previous versions, I refer to them as biallelic and triallelic topologies. I also used consistently genome copies instead of haplotypes (or homoeologs latter in the text about bdelloid possible hybrid origin).

Please, let me know if this makes sense to you.

KamilSJaron commented 5 years ago

I would hope that the section is actually quite readable now. If we will have to change something again, I will reopen this issue.

KamilSJaron / genomic-features-of-parthenogenetic-animals

heterozygosity structure section #25