WIP : playing with more aggressive normalization

Looking at simple ideas to reduce blocks with taf norm.

For instance, trying to remove all rows that

exceed the gap length
are duplicates of a sample that doesn't exceed the gap length

The motivation is that we're going to ignore duplicates anyway (goal is a MAF for the browser), so why not choose the copy to keep by how it fits with the previous block.

On a tiny test case from chr22 this gives the following for each iteration (0=raw input) of taffy norm | taffy view -m.

Current master

With dupe filter

Stripping all dupes at the end with mafDuplicateFilter then rerunning the norm gives

Current master

dupe filter

Just running mafDuplicateFilter then norm (without the previous iterations of norm) gives

Current master

dupe filter

Conclusion: without looking at much else, the dupe filter as implemented so far reduces the block count by 50%. Ideally we want to get to about 75% but it's a start -- will run and post whole-chromosome numbers... Also hopeful that improvements to chaining in cactus will eventually help these stats.

ComparativeGenomicsToolkit / taffy

WIP : playing with more aggressive normalization #15