Closed glennhickey closed 8 months ago
Some data from the zoonomia "10"-way test alignment. In all cases Red masks considerably more than lastz. It doesn't get everything lastz finds (as evidenced by the BOTH column being a bit bigger), but I think using it alone should be fine for most cases. Just need to run a few bigger tests (if the cluster ever frees up) to make sure sensitivity isn't reduced. Will also check to see where some of the added masking is coming from.
| INPUT | LASTZ | RED | BOTH -- | -- | -- | -- | -- bosTau8 | 0.490842 | 0.507322 | 0.534462 | 0.540855 canFam3 | 0.434422 | 0.441001 | 0.52214 | 0.523653 dipOrd1 | 0.386965 | 0.533592 | 0.542276 | 0.562005 equCab3 | 0.446994 | 0.458693 | 0.557123 | 0.559708 felCat8 | 0.452909 | 0.464814 | 0.525026 | 0.52738 hg38_without_alts | 0.545004 | 0.556511 | 0.583615 | 0.595528 mm10 | 0.467471 | 0.496097 | 0.515172 | 0.519506 panTro6 | 0.540358 | 0.550028 | 0.584369 | 0.586595 rheMac8 | 0.556622 | 0.565577 | 0.616582 | 0.618699 rn6 | 0.460768 | 0.483429 | 0.517443 | 0.52143 susScr11 | 0.459816 | 0.471296 | 0.535106 | 0.538855 tupChi1 | 0.429745 | 0.443084 | 0.506009 | 0.513617
RED is a fairly general purpose repeat masker. I'm interested in it because in my (very limited) tests so far it is fast and sensitive without specifying any parameters.
The current lastz-based repeatmasking, on the other hand, is causing problems with newer assemblies. Even with RepeatMasked/Modelled input genomes, it's both very slow and, apparently insufficient on some genomes. This leads to giant pairwise alignments (from all-to-all repeat copy collapses) which bog down bar to the point of crashing (perhaps too many rows into abpoa? I haven't confirmed) if the paffy chaining stuff beforehand doesn't run out of memory.
In theory, this shouldn't happen since the lastz masker should be able to filter out anything to repetitive in lastz (the parameters are a bit different but the seeding should be the same). I don't know if
proportionToSample="0.2"
is at play here, or it boils down to the difference in parameters, but something isn't working out.Anyway, there's not much to lose by trying another masker -- hence this branch. Red's fast enough that it can be added in before lastz with negligible cost (which is the default logic as I write this). I think it will be merge-worthy if we can then drop lastz without noticing a decrease in alignment quality (big win in running time), but it will also be worth it if either alone or combined with lastz it helps get some of these tricker genomes through the pipeline.