Open thiagogenez opened 4 months ago
This is generally caused by unmasked repeats in the input. Are your input genomes repeat masked? You can get a sense of how masked they are (if you don't know already) by grepping assembly stats from the cactus log.
Thanks, @glennhickey, for your thoughts.
I double-checked my input genomes, and they are repeat-masked. Here is the output of the assembly stats.
| Input Sample | Total Sequences | Total Length | Proportion Repeat Masked | Proportion Ns | Total Ns | N50 | Median Sequence Length | Max Sequence Length | Min Sequence Length |
|---------------------------------------|-----------------|----------------|--------------------------|---------------|-----------|--------------|------------------------|---------------------|---------------------|
| triticum_aestivum_A.fa | 7 | 4,934,891,648 | 0.885057 | 0.014803 | 73,053,193| 736,706,236 | 736,706,236 | 780,798,557 | 594,102,056 |
| triticum_aestivum_arinalrfor_A.fa | 7 | 4,969,946,898 | 0.880639 | 0.005448 | 27,077,568| 743,084,022 | 743,084,022 | 784,661,008 | 602,900,890 |
| triticum_aestivum_arinalrfor_B.fa | 7 | 5,250,395,981 | 0.875950 | 0.007224 | 37,926,871| 810,500,911 | 716,573,881 | 977,471,539 | 480,767,623 |
| triticum_aestivum_arinalrfor_D.fa | 7 | 3,974,021,563 | 0.870773 | 0.007369 | 29,286,018| 578,021,311 | 578,021,311 | 655,314,739 | 476,726,550 |
| triticum_aestivum_B.fa | 7 | 5,180,314,468 | 0.880186 | 0.016106 | 83,436,260| 720,988,478 | 720,988,478 | 830,829,764 | 673,617,499 |
| triticum_aestivum_D.fa | 7 | 3,951,074,735 | 0.874260 | 0.017538 | 69,292,437| 566,080,677 | 566,080,677 | 651,852,609 | 473,592,718 |
| triticum_aestivum_jagger_A.fa | 7 | 4,983,156,636 | 0.884702 | 0.014038 | 69,951,710| 743,847,818 | 743,847,818 | 804,285,258 | 596,211,899 |
| triticum_aestivum_jagger_B.fa | 7 | 5,219,166,998 | 0.879437 | 0.015615 | 81,496,077| 721,110,502 | 721,110,502 | 855,759,449 | 673,340,788 |
| triticum_aestivum_jagger_D.fa | 7 | 3,970,003,109 | 0.874087 | 0.015582 | 61,862,308| 570,159,854 | 570,159,854 | 673,981,989 | 459,355,444 |
| triticum_aestivum_julius_A.fa | 7 | 4,964,574,427 | 0.880889 | 0.009726 | 48,286,526| 745,978,486 | 745,978,486 | 791,475,352 | 586,755,746 |
| triticum_aestivum_julius_B.fa | 7 | 5,222,063,627 | 0.875711 | 0.011149 | 58,222,616| 727,285,804 | 727,285,804 | 858,776,195 | 670,301,833 |
| triticum_aestivum_julius_D.fa | 7 | 3,981,035,100 | 0.870365 | 0.011883 | 47,306,422| 575,129,590 | 575,129,590 | 661,246,824 | 479,660,269 |
| triticum_aestivum_kariega_A.fa | 7 | 5,033,091,561 | 0.877517 | 0.000541 | 2,722,485 | 755,457,679 | 755,457,679 | 794,474,755 | 613,662,638 |
| triticum_aestivum_kariega_B.fa | 7 | 5,333,683,798 | 0.873526 | 0.001459 | 7,780,157 | 738,041,677 | 738,041,677 | 864,624,966 | 701,857,263 |
| triticum_aestivum_kariega_D.fa | 7 | 4,086,356,048 | 0.867702 | 0.000152 | 621,384 | 584,285,409 | 584,285,409 | 662,526,948 | 504,659,958 |
| triticum_aestivum_lancer_A.fa | 7 | 4,907,196,294 | 0.883395 | 0.005918 | 29,042,896| 734,536,914 | 734,536,914 | 769,338,634 | 595,297,365 |
| triticum_aestivum_lancer_B.fa | 7 | 5,013,902,246 | 0.876464 | 0.006710 | 33,641,783| 702,438,406 | 702,438,406 | 839,470,345 | 665,179,885 |
| triticum_aestivum_lancer_D.fa | 7 | 3,950,540,886 | 0.872088 | 0.007500 | 29,627,862| 568,126,671 | 568,126,671 | 646,400,022 | 465,558,328 |
| triticum_aestivum_landmark_A.fa | 7 | 4,966,053,268 | 0.881559 | 0.016428 | 81,581,959| 740,148,362 | 740,148,362 | 791,748,890 | 595,339,094 |
| triticum_aestivum_landmark_B.fa | 7 | 5,204,724,784 | 0.876961 | 0.017895 | 93,136,106| 710,493,282 | 710,493,282 | 845,838,138 | 689,709,469 |
| triticum_aestivum_landmark_D.fa | 7 | 3,982,871,035 | 0.871485 | 0.019372 | 77,156,960| 570,643,040 | 570,643,040 | 656,817,438 | 484,551,304 |
| triticum_aestivum_mace_A.fa | 7 | 4,897,709,906 | 0.882735 | 0.006520 | 31,935,133| 732,118,298 | 732,118,298 | 782,694,893 | 590,561,804 |
| triticum_aestivum_mace_B.fa | 7 | 5,127,197,460 | 0.878159 | 0.007373 | 37,801,337| 704,156,067 | 704,156,067 | 848,590,828 | 667,607,564 |
| triticum_aestivum_mace_D.fa | 7 | 3,937,477,063 | 0.873358 | 0.008423 | 33,164,482| 567,265,955 | 567,265,955 | 650,274,702 | 475,327,881 |
| triticum_aestivum_mattis_A.fa | 7 | 4,933,556,187 | 0.883870 | 0.004501 | 22,204,811| 735,408,736 | 735,408,736 | 794,150,360 | 600,654,286 |
| triticum_aestivum_mattis_B.fa | 7 | 5,126,139,104 | 0.879247 | 0.005484 | 28,110,261| 799,857,935 | 698,878,671 | 969,998,116 | 467,876,140 |
| triticum_aestivum_mattis_D.fa | 7 | 3,938,862,683 | 0.873429 | 0.005638 | 22,207,647| 566,465,558 | 566,465,558 | 655,329,108 | 480,431,564 |
| triticum_aestivum_norin61_A.fa | 7 | 4,921,847,059 | 0.875677 | 0.005924 | 29,155,163| 723,255,126 | 723,255,126 | 781,462,734 | 594,006,513 |
| triticum_aestivum_norin61_B.fa | 7 | 5,194,186,346 | 0.868098 | 0.007617 | 39,566,671| 715,454,519 | 715,454,519 | 850,623,622 | 669,876,730 |
| triticum_aestivum_norin61_D.fa | 7 | 3,941,626,919 | 0.865354 | 0.007639 | 30,108,644| 564,869,106 | 564,869,106 | 650,275,864 | 478,264,344 |
| triticum_aestivum_paragon_A.fa | 7 | 5,016,927,533 | 0.878688 | 0.000004 | 18,600 | 759,055,895 | 759,055,895 | 795,989,443 | 599,230,268 |
| triticum_aestivum_paragon_B.fa | 7 | 5,310,532,019 | 0.875044 | 0.000014 | 73,000 | 733,835,468 | 733,835,468 | 872,909,281 | 688,536,368 |
| triticum_aestivum_paragon_D.fa | 7 | 4,092,665,763 | 0.870697 | 0.000002 | 8,800 | 586,077,705 | 586,077,705 | 670,531,570 | 499,575,344 |
| triticum_aestivum_renan_A.fa | 7 | 4,966,282,335 | 0.877081 | 0.014966 | 74,324,342| 746,502,734 | 746,502,734 | 792,837,209 | 593,930,347 |
| triticum_aestivum_renan_B.fa | 7 | 5,216,673,246 | 0.872227 | 0.016818 | 87,731,823| 717,542,863 | 717,542,863 | 854,463,248 | 673,746,810 |
| triticum_aestivum_renan_D.fa | 7 | 4,012,688,034 | 0.868981 | 0.022570 | 90,566,151| 569,771,178 | 569,771,178 | 661,835,603 | 493,761,083 |
| triticum_aestivum_stanley_A.fa | 7 | 4,993,448,364 | 0.885270 | 0.014144 | 70,628,727| 742,917,797 | 742,917,797 | 803,232,604 | 591,313,643 |
| triticum_aestivum_stanley_B.fa | 7 | 5,227,912,278 | 0.880709 | 0.015717 | 82,167,782| 715,714,221 | 715,714,221 | 856,542,542 | 697,113,365 |
| triticum_aestivum_stanley_D.fa | 7 | 3,986,277,988 | 0.874331 | 0.017540 | 69,917,336| 572,943,128 | 572,943,128 | 657,494,025 | 483,823,121 |
| triticum_dicoccoides_A.fa | 7 | 4,899,336,816 | 0.883402 | 0.014168 | 69,415,818| 726,427,787 | 726,427,787 | 775,183,943 | 593,586,810 |
| triticum_dicoccoides_B.fa | 7 | 5,179,702,578 | 0.878532 | 0.015650 | 81,060,827| 712,180,895 | 712,180,895 | 841,096,276 | 673,896,466 |
| triticum_spelta_A.fa | 7 | 4,900,765,403 | 0.880041 | 0.005361 | 26,271,701| 737,453,356 | 737,453,356 | 782,685,093 | 583,494,258 |
| triticum_spelta_B.fa | 7 | 5,134,000,283 | 0.873542 | 0.006527 | 33,510,942| 708,205,786 | 708,205,786 | 835,583,350 | 669,032,550 |
| triticum_spelta_D.fa | 7 | 3,965,570,715 | 0.869875 | 0.007229 | 28,668,590| 573,398,137 | 573,398,137 | 648,139,033 | 471,251,328 |
| triticum_timopheevii_A.fa | 7 | 4,849,233,683 | 0.874839 | 0.000010 | 46,400 | 694,350,238 | 694,350,238 | 771,176,557 | 585,824,631 |
| triticum_timopheevii_B.fa | 7 | 4,403,617,647 | 0.853180 | 0.000047 | 205,500 | 643,128,204 | 643,128,204 | 692,654,486 | 495,016,746 |
| triticum_urartu.fa | 10,284 | 4,851,895,022 | 0.903672 | 0.006242 | 30,285,589| 661,480,603 | 12,085 | 753,719,114 | 1,728 |
| aegilops_tauschii.fa | 109,583 | 4,224,915,394 | 0.867977 | 0.022705 | 95,926,476| 577,375,663 | 612 | 651,661,114 | 384 |
| brachypodium_distachyon.fa | 10 | 271,163,419 | 0.380832 | 0.001563 | 423,958 | 59,130,575 | 28,630,136 | 75,071,545 | 1,933 |
| hordeum_vulgare.fa | 290 | 4,225,577,519 | 0.853346 | 0.000314 | 1,325,794 | 610,333,535 | 76,179 | 665,585,731 | 50,002 |
| secale_cereale.fa | 8 | 6,735,227,109 | 0.881450 | 0.023124 | 155,746,534| 899,925,126 | 899,925,126 | 965,754,312 | 528,437,893 |
Do you know why paffy tile
(from cactus-blast
) takes too much time (4 days+) and consumes more than 1TB of memory? Given the table above, was this an expected behaviour?
On our previous wheat
alignment, we faced faster execution time (1.5 on average) for cactus-blast
and less memory consumption with similar assembly stats numbers.
Thanks
paffy tile
does not scale well with very large sets of pairwise alignments, ex #905 and #877. For vertebrates the solution as always been better masking. Switching to the RED preprocessor further helped for some t2t-genomes.
I have no experience aligning wheat, but obviously there is an issue here. How were the genomes masked? Is this data you've aligned successfully with an older version of Cactus? If so, perhaps activating the lastz preprocessor could help (though it would probably add much more compute than a 4-day paffy tile
. Can you share a subset of this data to reproduce the error? If so I'll put it in the debugger and can work with @benedictpaten to hopefully add a more agressive filter or something to prevent these super long runtimes. Thanks. It's pretty easy to imagine this type of problem coming up somewhere in the VGP tree, so it's best to get it sorted out as soon as possible.
Looking up wheat it's apparently about 17Gb hexaploid. If your inputs are 5Gb, that would make them diploid? I could definitely see input genomes being diploid exacerbating the pairwise alignment situation.
Cactus has some support for diploid genomes, but you have to divide them up into haploid fasta files. You can see here for an example of a diploid primate alignment here: https://cgl.gi.ucsc.edu/data/cactus/t2t-apes/16-t2t-apes-2023v2/
So if your genomes are indeed diploid, splitting them up in this way is the first thing I'd try.
Hi @glennhickey... @thiagogenez has asked me to add my two cents here. Unfortunately these Wheat genomes are not diploid in the sense that you describe, and each of the files — even the 5Gb ones — already represent a haploid subgenome. So unfortunately we wouldn't be able to split them up in that way.
@twalsh-ebi thanks for the confirmation. How were they masked? Can they be shared with me to reproduce?
I didn't do the masking myself, @glennhickey, but my understanding is that they were masked using RED.
All but one of the soft-masked genome sequences are already on the Ensembl FTP site. I trust @thiagogenez will get in touch with you about the one that is not listed here.
Are there RepeatMasker libraries for wheat? This may be what's required (masking with RepeatMasker) to efficiently align these genomes with Cactus..
Issue: Long Running Time and High Memory Consumption for
paffy
incactus-blast
Hello,
I am aligning wheat genomes using the
cactus-blast
workflow, and I have encountered significant issues with thepaffy
step. Specifically, the running time and memory consumption appear to be unusually high.Details
Log Snippet
Description of the Issue
The
paffy
step fromcactus-blast
consumes a substantial amount of memory (ranging from 600GB to 1000GB) and takes around 3.5 days to complete. Given the task, this seems excessively long and resource-intensive.Questions
paffy
when aligning wheat genomes of this size?Any guidance or suggestions would be greatly appreciated.
Thank you so much for your help.