ComparativeGenomicsToolkit / taffy

This is a library C/Python/CLI for working with TAF (.taf,.taf.gz) and MAF (.maf) alignment files
MIT License
24 stars 3 forks source link

WIP : playing with more aggressive normalization #15

Closed glennhickey closed 1 year ago

glennhickey commented 1 year ago

Looking at simple ideas to reduce blocks with taf norm.

For instance, trying to remove all rows that

The motivation is that we're going to ignore duplicates anyway (goal is a MAF for the browser), so why not choose the copy to keep by how it fits with the previous block.

On a tiny test case from chr22 this gives the following for each iteration (0=raw input) of taffy norm | taffy view -m.

Current master

0: 2660
1: 1040
2: 974
3: 970
3: 970

With dupe filter

0: 2660
1: 547
2: 447
3: 433
4: 429
5: 427
6: 427

Stripping all dupes at the end with mafDuplicateFilter then rerunning the norm gives

Current master

855

dupe filter

418

Just running mafDuplicateFilter then norm (without the previous iterations of norm) gives

Current master

1558

dupe filter

1558

Conclusion: without looking at much else, the dupe filter as implemented so far reduces the block count by 50%. Ideally we want to get to about 75% but it's a start -- will run and post whole-chromosome numbers... Also hopeful that improvements to chaining in cactus will eventually help these stats.

glennhickey commented 1 year ago

It took a while but on the big chr22 I have the number of blocks as follows:

Cactus (after taffy norm) : 4,805,087 Cactus (after subsequent dupe norm): 1,173,851 MultiZ (different input, but kind of similar) : 1,331,045

Something's probably mucked up, but this looks encouraging.