three or more apparent haplotypes at repeats

ekg commented 8 years ago

A lot of our errors look like this:

But when we go to tview, we see that the problem. The reads match the reference, but only when they don't fully overlap the locus.

There are a few possible solutions to this.

Detect the repeated sequences and require full overlap (similar to freebayes).

Include the alignment start and end coordinates as a feature.

@nikete thoughts?

nikete commented 8 years ago

A mix of the two might be good; I dont see how the coordinates directly would map linearly without exaustively enumerating that space in the training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would waste a lot of knowledge, right?

ekg commented 8 years ago

The coordinates would be relative to the window, or maybe to any underlying repeats.

The learner can't seem to figure out that there is a repeat and the sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna < notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly would map linearly without exaustively enumerating that space in the training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would waste a lot of knowledge, right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ekg/hhga/issues/29#issuecomment-222711494, or mute the thread https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN .

ekg commented 8 years ago

By the way, we do retain information from the alignments to the graph, so we're not necessarily throwing these all out. We might just want to mark which reads don't completely overlap the locus. Maybe we put them in the incomplete pile. And we should be exposing the repeat structures in some way. Otherwise it would seem to have no mechanism to learn them.

On Tue, May 31, 2016 at 3:47 PM Erik Garrison erik.garrison@gmail.com wrote:

The coordinates would be relative to the window, or maybe to any underlying repeats.

The learner can't seem to figure out that there is a repeat and the sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna < notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly would map linearly without exaustively enumerating that space in the training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would waste a lot of knowledge, right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ekg/hhga/issues/29#issuecomment-222711494, or mute the thread https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN .

nikete commented 8 years ago

even the relative coordinates to make them linearly express this would have to be quadratic to the reference, right?

marking the reads that dont overlap the locus seems like it is asking a lot of the linear learner, in articula,r here those are the oen with all the info and in other cases it is the oposite, im fraid it average out

exposing the repeat structure seem important, agreed.

ekg commented 8 years ago

Another example:

In tview it's clear that the reads supporting the reference aren't fully overlapping the repeat.

ekg commented 8 years ago

In freebayes we can exclude these. In fact, although that isn't default there is discussion and support from folks like @chapmanb that we should do so by default as it improves performance. --no-partial-observations and --min-repeat-entropy are used to change this behavior.

@nikete to explain: freebayes decides on a haplotype window over which it infers the genotype(s) for the samples in the analysis. In cases where there is an exact repeat or the sequence at the locus is a short repeat followed by low-complexity sequence, we use a haploytpe window long enough to reach one shannon per base (--min-repeat-entropy 1), and exclude any reads that only partially overlap the resulting window --no-partial-observations. These are rather difficult to use correctly, but many people would be interested if we can figure out a nice way to do so.

The graph feature should be capturing even the stuff that doesn't fully overlap the locus.

ekg commented 8 years ago

In the last case the graph feature doesn't help us because we've inappropriately broken the site into two. That's another problem that I find a bit confusing... I thought I'd resolved this as well but apparently not enough.

ekg commented 8 years ago

It's not the right thing to do to call reference. At these examples we have non-reference genotypes.

nikete commented 8 years ago

just as a note for the future: this will work well on 50X stuff, but for miniIOn or low coverage methods it might be better to not take them, to center we could use the middle window to bethe high entropy region

ekg / hhga

three or more apparent haplotypes at repeats #29