ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

[experimental] add RED as a preprocessor masking option #1296

Closed glennhickey closed 8 months ago

glennhickey commented 8 months ago

RED is a fairly general purpose repeat masker. I'm interested in it because in my (very limited) tests so far it is fast and sensitive without specifying any parameters.

The current lastz-based repeatmasking, on the other hand, is causing problems with newer assemblies. Even with RepeatMasked/Modelled input genomes, it's both very slow and, apparently insufficient on some genomes. This leads to giant pairwise alignments (from all-to-all repeat copy collapses) which bog down bar to the point of crashing (perhaps too many rows into abpoa? I haven't confirmed) if the paffy chaining stuff beforehand doesn't run out of memory.

In theory, this shouldn't happen since the lastz masker should be able to filter out anything to repetitive in lastz (the parameters are a bit different but the seeding should be the same). I don't know if proportionToSample="0.2" is at play here, or it boils down to the difference in parameters, but something isn't working out.

Anyway, there's not much to lose by trying another masker -- hence this branch. Red's fast enough that it can be added in before lastz with negligible cost (which is the default logic as I write this). I think it will be merge-worthy if we can then drop lastz without noticing a decrease in alignment quality (big win in running time), but it will also be worth it if either alone or combined with lastz it helps get some of these tricker genomes through the pipeline.

glennhickey commented 8 months ago

Some data from the zoonomia "10"-way test alignment. In all cases Red masks considerably more than lastz. It doesn't get everything lastz finds (as evidenced by the BOTH column being a bit bigger), but I think using it alone should be fine for most cases. Just need to run a few bigger tests (if the cluster ever frees up) to make sure sensitivity isn't reduced. Will also check to see where some of the added masking is coming from.

  | INPUT | LASTZ | RED | BOTH -- | -- | -- | -- | -- bosTau8 | 0.490842 | 0.507322 | 0.534462 | 0.540855 canFam3 | 0.434422 | 0.441001 | 0.52214 | 0.523653 dipOrd1 | 0.386965 | 0.533592 | 0.542276 | 0.562005 equCab3 | 0.446994 | 0.458693 | 0.557123 | 0.559708 felCat8 | 0.452909 | 0.464814 | 0.525026 | 0.52738 hg38_without_alts | 0.545004 | 0.556511 | 0.583615 | 0.595528 mm10 | 0.467471 | 0.496097 | 0.515172 | 0.519506 panTro6 | 0.540358 | 0.550028 | 0.584369 | 0.586595 rheMac8 | 0.556622 | 0.565577 | 0.616582 | 0.618699 rn6 | 0.460768 | 0.483429 | 0.517443 | 0.52143 susScr11 | 0.459816 | 0.471296 | 0.535106 | 0.538855 tupChi1 | 0.429745 | 0.443084 | 0.506009 | 0.513617

glennhickey commented 8 months ago

Here are stats for the 8-way t2t apes alignment

hs1 coverage with lastz masking

GCA_028858775.2, 904679, 26398, 14845, 10558, 7648, 5600, 4456, 3835, 3287, 2935, 2333, 1847, 1491, 1273, 1110, 1078, 943, 852, 767, 560, 311, 306, 268, 267, 260, 259, 225, 223, 194, 189, 184, 183, 182, 181, 176, 160, 30, 19, 9, 9, 8, 8, 0, 0, 0, 0, 0
GCA_028878055.2, 834107, 16858, 6329, 3952, 2712, 2041, 1495, 1207, 915, 827, 746, 665, 620, 559, 533, 491, 491, 491, 486, 472, 445, 420, 352, 330, 310, 240, 235, 213, 213, 211, 209, 155, 136, 119, 119, 113, 110, 109, 106, 106, 106, 106, 105, 97, 62, 54, 3
GCA_028885625.2, 873401, 23453, 11500, 8163, 6183, 4595, 3579, 2464, 2007, 1660, 1318, 1130, 995, 973, 884, 822, 708, 652, 552, 515, 332, 270, 188, 183, 128, 88, 76, 59, 16, 11, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_028885655.2, 873616, 23450, 11110, 7454, 5565, 4406, 3755, 3157, 2635, 2010, 1580, 1338, 1079, 909, 831, 809, 774, 705, 644, 604, 493, 474, 440, 420, 401, 368, 342, 270, 254, 222, 196, 164, 133, 128, 128, 124, 118, 105, 77, 52, 1, 0, 0, 0, 0, 0, 0
GCA_029281585.2, 898540, 32974, 18157, 12538, 9281, 7001, 5716, 4534, 3423, 2676, 2288, 2021, 1839, 1674, 1391, 1340, 1232, 1067, 957, 894, 786, 690, 503, 493, 453, 430, 427, 343, 275, 145, 130, 109, 80, 42, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_029289425.2, 904715, 26311, 15115, 10246, 7569, 5579, 4351, 3442, 2868, 2500, 2149, 1549, 1360, 1194, 879, 685, 558, 508, 495, 438, 414, 353, 228, 188, 48, 33, 17, 16, 16, 16, 16, 16, 16, 16, 15, 15, 12, 9, 9, 9, 9, 9, 9, 8, 8, 6, 0
hg38, 928939, 31630, 10339, 6150, 4144, 2943, 2047, 1315, 532, 438, 197, 163, 115, 88, 61, 59, 44, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
hs1, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

hs1 coverage with red masking

GCA_028858775.2, 899305, 21726, 10156, 5661, 3505, 1841, 1214, 767, 598, 472, 350, 39, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_028878055.2, 830565, 14827, 5027, 2597, 1422, 797, 451, 274, 172, 125, 101, 39, 17, 17, 17, 14, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_028885625.2, 869086, 19628, 8368, 5136, 3248, 1802, 1197, 703, 386, 231, 67, 12, 10, 8, 8, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_028885655.2, 869411, 19661, 7975, 4826, 2971, 1862, 1257, 872, 519, 281, 152, 78, 46, 22, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_029281585.2, 893581, 27964, 13719, 8498, 5675, 3537, 2307, 1588, 945, 455, 266, 162, 109, 18, 18, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
GCA_029289425.2, 899301, 21498, 10191, 5251, 3047, 1575, 787, 268, 95, 71, 61, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
hg38, 925770, 10242, 3497, 1621, 779, 359, 133, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37
hs1, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

coverage is lower, but only by about half a percent.

Ancestor sizes (lastz)

Anc0, 2, 2621303571, 4457, 0, 18106805
Anc1, 2, 2750525449, 1810, 21691233, 39686202
Anc2, 2, 2989900301, 1688, 40775105, 31488453
Anc3, 2, 2812845431, 4226, 39462021, 30997293
Anc4, 2, 2806166526, 1801, 30469142, 20823225
Anc5, 2, 2947834679, 5705, 21358897, 18137607
Anc6, 2, 2882020460, 1170, 21389405, 14483755

Ancestor sizes (red)

Anc0, 2, 2588464981, 1119, 0, 17651453
Anc1, 2, 2748056534, 1554, 21622578, 39511448
Anc2, 2, 2942365383, 821, 40085301, 29988501
Anc3, 2, 2805135350, 2100, 39352926, 30460457
Anc4, 2, 2803591636, 1453, 30309872, 20678654
Anc5, 2, 2895573863, 1041, 20910344, 16837025
Anc6, 2, 2871730368, 723, 21037788, 14363921

These seem in line with the coverage (slightly smaller) -- though a relatively larger reduction in number of ancestral contigs.

glennhickey commented 8 months ago

Stats for the "10"-way zoonomia test alignment.

human lastz-masking

Cat, 450867, 16350, 1649, 227, 58, 27, 12, 7, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0
Chimp, 903786, 24619, 10001, 5323, 3164, 1705, 942, 673, 419, 230, 117, 30, 6, 6, 6, 5, 3, 0
Cow, 399819, 2808, 449, 170, 65, 31, 19, 10, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0
Dog, 444130, 3152, 646, 285, 156, 76, 50, 38, 26, 15, 4, 4, 3, 2, 2, 2, 0, 0
hg38, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Horse, 481402, 11674, 5943, 2188, 483, 274, 169, 105, 65, 42, 29, 21, 14, 6, 4, 0, 0, 0
Kangaroo_rat, 221255, 2446, 214, 69, 32, 22, 16, 11, 9, 8, 4, 3, 2, 0, 0, 0, 0, 0
Mouse, 263116, 1690, 422, 157, 98, 60, 38, 19, 11, 3, 0, 0, 0, 0, 0, 0, 0, 0
Pig, 413760, 3857, 658, 266, 156, 74, 41, 28, 18, 8, 6, 6, 5, 4, 4, 4, 4, 4
Rat, 262243, 8620, 673, 158, 58, 24, 13, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
Rhesus, 811316, 22064, 2637, 1037, 489, 245, 163, 94, 61, 40, 23, 9, 7, 0, 0, 0, 0, 0
Tree_shrew, 427998, 10087, 787, 237, 88, 41, 23, 13, 4, 4, 2, 2, 2, 1, 0, 0, 0, 0

human red-masking

Cat, 452681, 14564, 1147, 127, 24, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Chimp, 903386, 20801, 7290, 2973, 1615, 677, 338, 148, 63, 26, 7, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Cow, 401046, 2542, 335, 105, 46, 31, 16, 7, 4, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Dog, 445422, 2894, 443, 139, 32, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Horse, 480807, 9548, 3995, 1175, 325, 185, 114, 71, 56, 39, 30, 12, 12, 12, 10, 10, 9, 3, 3, 3, 3, 3, 3, 2
Human, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Kangaroo_rat, 223011, 2432, 149, 38, 11, 5, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Mouse, 264918, 1472, 368, 104, 59, 21, 12, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Pig, 413911, 3487, 499, 162, 62, 23, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rat, 263785, 7921, 449, 83, 16, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rhesus, 813330, 19846, 2321, 729, 312, 172, 108, 83, 57, 37, 25, 23, 22, 14, 12, 3, 2, 0, 0, 0, 0, 0, 0, 0
Tree_shrew, 429003, 9472, 652, 180, 79, 44, 28, 11, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

mouse lastz-masking

Cat, 277886, 9557, 946, 98, 25, 14, 5, 4, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Chimp, 298705, 4152, 1095, 446, 242, 137, 82, 47, 35, 26, 20, 13, 6, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Cow, 257971, 1583, 254, 81, 38, 17, 12, 6, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Dog, 274382, 1701, 343, 170, 83, 38, 25, 20, 10, 6, 4, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
hg38, 299088, 7800, 2035, 1144, 741, 560, 409, 229, 131, 91, 52, 40, 26, 22, 12, 11, 9, 6, 4, 3, 2, 2, 2, 2, 2, 2
Horse, 289734, 6400, 3036, 1132, 224, 115, 74, 52, 32, 25, 19, 14, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Kangaroo_rat, 205615, 2711, 296, 121, 77, 50, 29, 13, 9, 4, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Mouse, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Pig, 263779, 1683, 295, 108, 53, 25, 12, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rat, 638643, 21616, 2060, 514, 207, 91, 47, 20, 7, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0
Rhesus, 296601, 7692, 729, 210, 80, 27, 15, 7, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Tree_shrew, 259700, 6675, 460, 132, 57, 32, 10, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

mouse red-masking

Cat, 275766, 8757, 706, 61, 20, 7, 4, 3, 1, 0, 0, 0, 0, 0, 0, 0
Chimp, 296184, 3886, 898, 355, 176, 105, 38, 16, 9, 5, 4, 2, 1, 1, 0, 0
Cow, 255127, 1557, 247, 73, 32, 21, 11, 4, 2, 1, 0, 0, 0, 0, 0, 0
Dog, 272379, 1646, 266, 92, 37, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Horse, 286628, 5411, 2187, 623, 154, 80, 45, 29, 18, 7, 2, 0, 0, 0, 0, 0
Human, 296509, 3836, 1027, 485, 246, 143, 102, 57, 25, 10, 3, 0, 0, 0, 0, 0
Kangaroo_rat, 205403, 2564, 274, 110, 73, 45, 31, 20, 14, 7, 1, 1, 1, 1, 1, 1
Mouse, 1000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Pig, 260962, 1589, 232, 76, 26, 17, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0
Rat, 636432, 20973, 1666, 357, 118, 41, 26, 13, 8, 6, 0, 0, 0, 0, 0, 0
Rhesus, 294325, 7439, 660, 209, 94, 45, 22, 12, 7, 3, 2, 1, 0, 0, 0, 0
Tree_shrew, 256781, 6488, 450, 148, 64, 41, 20, 4, 1, 0, 0, 0, 0, 0, 0, 0

lastz ancestor sizes

Anc00, 2, 1683128530, 742, 0, 34877006
Anc01, 2, 1734418456, 2117, 38355684, 91373438
Anc02, 2, 1974709513, 1138, 44667465, 113919848
Anc03, 2, 1852444713, 4156, 94412817, 126967391
Anc04, 2, 1148410508, 21349, 70707466, 82001309
Anc05, 2, 2055313406, 1373, 111138083, 138956539
Anc06, 2, 2023562392, 1635, 113721366, 111386546
Anc07, 2, 2558871283, 1294, 128203552, 116004698
Anc08, 2, 1870860341, 5900, 76263182, 106633627
Anc09, 2, 1877861362, 2121, 102281098, 130898254
Anc10, 2, 2801350060, 1996, 119268294, 54021438

red ancestor sizes

Anc00, 2, 1668090829, 672, 0, 34689274
Anc01, 2, 1718122809, 1617, 38172938, 90432580
Anc02, 2, 1961320928, 1048, 44449663, 112007569
Anc03, 2, 1836542960, 3204, 93500302, 125492452
Anc04, 2, 1135217994, 17694, 69902710, 81463289
Anc05, 2, 2046101515, 1097, 109424795, 137154443
Anc06, 2, 2014252975, 1670, 111792980, 110482953
Anc07, 2, 2553413622, 1077, 126859401, 114413890
Anc08, 2, 1849877018, 3816, 75848304, 104899354
Anc09, 2, 1867211401, 1974, 101423161, 129398977
Anc10, 2, 2796020125, 1551, 117665741, 53886732
glennhickey commented 8 months ago

For the primates, lastz masking took 2474 cpu hours and 24 wall hours on the cluster, compared to roughly 16 cpu hours and 2 wall hours for red.

Also, when trying to compute a 7-way alignment after dropping hg38, the lastz-masked alignment fails completely in bar. The red-masked alignment runs fine.

Conclusion: the improvements in speed and robustness are well worth making the switch, even if coverage drops a tiny bit in some cases (and this coverage drop is surely in extremely repetitive regions whose alignment is of questionable quality to begin with).

So I think I'm ready to merge, unless @benedictpaten you have any objections...

glennhickey commented 8 months ago

Here are some consolidated running times, as extracted from the log with grep Succ | awk '{print $5 "\t" $38}' | sort.

They show modest speedup in 10-way but massive speedup in t2t apes.

primates lastz

cactus_consolidated(Anc2):  15151.6102
cactus_consolidated(Anc3):  14124.735
cactus_consolidated(Anc4):  5456.394
cactus_consolidated(Anc5):  41010.9872
cactus_consolidated(Anc6):  6321.4298

primates red

cactus_consolidated(Anc2):  9308.8388
cactus_consolidated(Anc3):  9173.2259
cactus_consolidated(Anc4):  4775.8141
cactus_consolidated(Anc5):  6954.8602
cactus_consolidated(Anc6):  5804.2141

10-way lastz

cactus_consolidated(Anc02): 21001.3839
cactus_consolidated(Anc03): 51222.4407
cactus_consolidated(Anc04): 48323.943
cactus_consolidated(Anc05): 34137.6974
cactus_consolidated(Anc06): 30040.258
cactus_consolidated(Anc07): 25834.582
cactus_consolidated(Anc08): 38496.5654
cactus_consolidated(Anc09): 45839.7909
cactus_consolidated(Anc10): 11674.8884

10-way red

cactus_consolidated(Anc02): 20682.3916
cactus_consolidated(Anc03): 47745.1372
cactus_consolidated(Anc04): 50047.3168
cactus_consolidated(Anc05): 34607.4769
cactus_consolidated(Anc06): 30442.5855
cactus_consolidated(Anc07): 26893.5983
cactus_consolidated(Anc08): 37809.4971
cactus_consolidated(Anc09): 44850.9519
cactus_consolidated(Anc10): 11383.4624
benedictpaten commented 8 months ago

Brilliant. I think this is ready to merge.