collaborativebioinformatics / STRdust

MIT License
13 stars 3 forks source link

Find heuristics to pre-filter insertions that we don't want spoa to spend too much time on #33

Open wdecoster opened 2 years ago

wdecoster commented 2 years ago

Like length and/or repetitiveness

wdecoster commented 2 years ago

currently: if any of the insertions is longer than 7500bp, the expansion is simply discarded. That is not desirable. A less bad solution would be to pick just one of the insertions and set that as the seq (which is then not polished by the consensus) but it is at least not lost

PavelAvdeyev commented 2 years ago

I agree that current strategy is bad.

I think we should develop some sort of classic outlier detection algo. When I was debugging the script, I observed that many of the insertions have pretty similar length. It is, in some sense, expected and makes a picture easier than for short reads. So, we potentially can calculate length mean and than disregard some examples that have much longer insertions. We never disregard shorter one since it can be produced from soft clipping sequence.

Overall, it is very interesting question since we are using MSA on the later step. In some sense, it would always report a consensus sequence with maximal length if I understand everything correctly. So, it is crucially to do filtration based on insertion lengths here. From some perspective, we are doing genotyping at this step. Later, we just find a sequence.

Some additional ideas: Calculate the most common substring for set of insertions. This gives us rough estimate of motif length. After that, we can consider interval of [0, length mean + k * rough motif length) to filter something long.

Calculate the most common substring for set of insertions and mean of repeat units. Then, tools tries to find parameters (via max likelihood) by assuming nanopore error model (if any) that allow generate something similar to observations (means or some nice distribution). After that, we disregard everything else.

wdecoster commented 2 years ago

what I would like to add to this is that there could be long softclips that are actually unrelated to the expansion but are just normal sequence, e.g. from a chimeric molecule. Those are rare, but removing those would be a good thing.

Those would probably be an outlier

PavelAvdeyev commented 2 years ago

@wdecoster I am also thinking that we should parse MSA more carefully and evaluate the support of each letter. If, for example, some letters is supported just by one or two sequences from alignments, they are good candidates to be removed.