Closed glennhickey closed 1 year ago
It took a while but on the big chr22 I have the number of blocks as follows:
Cactus (after taffy norm
) : 4,805,087
Cactus (after subsequent dupe norm): 1,173,851
MultiZ (different input, but kind of similar) : 1,331,045
Something's probably mucked up, but this looks encouraging.
Looking at simple ideas to reduce blocks with
taf norm
.For instance, trying to remove all rows that
The motivation is that we're going to ignore duplicates anyway (goal is a MAF for the browser), so why not choose the copy to keep by how it fits with the previous block.
On a tiny test case from chr22 this gives the following for each iteration (0=raw input) of
taffy norm | taffy view -m
.Current master
With dupe filter
Stripping all dupes at the end with
mafDuplicateFilter
then rerunning the norm givesCurrent master
dupe filter
Just running
mafDuplicateFilter
then norm (without the previous iterations of norm) givesCurrent master
dupe filter
Conclusion: without looking at much else, the dupe filter as implemented so far reduces the block count by 50%. Ideally we want to get to about 75% but it's a start -- will run and post whole-chromosome numbers... Also hopeful that improvements to chaining in cactus will eventually help these stats.