ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Long Running Time and High Memory Consumption for `paffy` in `cactus-blast` #1420

Open thiagogenez opened 1 week ago

thiagogenez commented 1 week ago

Issue: Long Running Time and High Memory Consumption for paffy in cactus-blast

Hello,

I am aligning wheat genomes using the cactus-blast workflow, and I have encountered significant issues with the paffy step. Specifically, the running time and memory consumption appear to be unusually high.

Details

Log Snippet

[2024-06-20T18:32:53+0100] [MainThread] [I] [toil-rt] 2024-06-20 18:32:53.289192: Running the command: "paffy tile -i /tmp/toilwf-925ffc07a6be59998a17139eaf96f9df/7fe2/f5c6/tmp9vkub48a/chained_Anc06 --logLevel DEBUG"

Description of the Issue

The paffy step from cactus-blast consumes a substantial amount of memory (ranging from 600GB to 1000GB) and takes around 3.5 days to complete. Given the task, this seems excessively long and resource-intensive.

Questions

  1. Is this expected behaviour for paffy when aligning wheat genomes of this size?
  2. Are there any recommended optimizations or configurations to reduce memory usage and computation time?

Any guidance or suggestions would be greatly appreciated.

Thank you so much for your help.

glennhickey commented 1 week ago

This is generally caused by unmasked repeats in the input. Are your input genomes repeat masked? You can get a sense of how masked they are (if you don't know already) by grepping assembly stats from the cactus log.

thiagogenez commented 3 days ago

Thanks, @glennhickey, for your thoughts.

I double-checked my input genomes, and they are repeat-masked. Here is the output of the assembly stats.

| Input Sample                          | Total Sequences | Total Length   | Proportion Repeat Masked | Proportion Ns | Total Ns  | N50          | Median Sequence Length | Max Sequence Length | Min Sequence Length |
|---------------------------------------|-----------------|----------------|--------------------------|---------------|-----------|--------------|------------------------|---------------------|---------------------|
| triticum_aestivum_A.fa                | 7               | 4,934,891,648  | 0.885057                 | 0.014803      | 73,053,193| 736,706,236  | 736,706,236            | 780,798,557         | 594,102,056         |
| triticum_aestivum_arinalrfor_A.fa     | 7               | 4,969,946,898  | 0.880639                 | 0.005448      | 27,077,568| 743,084,022  | 743,084,022            | 784,661,008         | 602,900,890         |
| triticum_aestivum_arinalrfor_B.fa     | 7               | 5,250,395,981  | 0.875950                 | 0.007224      | 37,926,871| 810,500,911  | 716,573,881            | 977,471,539         | 480,767,623         |
| triticum_aestivum_arinalrfor_D.fa     | 7               | 3,974,021,563  | 0.870773                 | 0.007369      | 29,286,018| 578,021,311  | 578,021,311            | 655,314,739         | 476,726,550         |
| triticum_aestivum_B.fa                | 7               | 5,180,314,468  | 0.880186                 | 0.016106      | 83,436,260| 720,988,478  | 720,988,478            | 830,829,764         | 673,617,499         |
| triticum_aestivum_D.fa                | 7               | 3,951,074,735  | 0.874260                 | 0.017538      | 69,292,437| 566,080,677  | 566,080,677            | 651,852,609         | 473,592,718         |
| triticum_aestivum_jagger_A.fa         | 7               | 4,983,156,636  | 0.884702                 | 0.014038      | 69,951,710| 743,847,818  | 743,847,818            | 804,285,258         | 596,211,899         |
| triticum_aestivum_jagger_B.fa         | 7               | 5,219,166,998  | 0.879437                 | 0.015615      | 81,496,077| 721,110,502  | 721,110,502            | 855,759,449         | 673,340,788         |
| triticum_aestivum_jagger_D.fa         | 7               | 3,970,003,109  | 0.874087                 | 0.015582      | 61,862,308| 570,159,854  | 570,159,854            | 673,981,989         | 459,355,444         |
| triticum_aestivum_julius_A.fa         | 7               | 4,964,574,427  | 0.880889                 | 0.009726      | 48,286,526| 745,978,486  | 745,978,486            | 791,475,352         | 586,755,746         |
| triticum_aestivum_julius_B.fa         | 7               | 5,222,063,627  | 0.875711                 | 0.011149      | 58,222,616| 727,285,804  | 727,285,804            | 858,776,195         | 670,301,833         |
| triticum_aestivum_julius_D.fa         | 7               | 3,981,035,100  | 0.870365                 | 0.011883      | 47,306,422| 575,129,590  | 575,129,590            | 661,246,824         | 479,660,269         |
| triticum_aestivum_kariega_A.fa        | 7               | 5,033,091,561  | 0.877517                 | 0.000541      | 2,722,485 | 755,457,679  | 755,457,679            | 794,474,755         | 613,662,638         |
| triticum_aestivum_kariega_B.fa        | 7               | 5,333,683,798  | 0.873526                 | 0.001459      | 7,780,157 | 738,041,677  | 738,041,677            | 864,624,966         | 701,857,263         |
| triticum_aestivum_kariega_D.fa        | 7               | 4,086,356,048  | 0.867702                 | 0.000152      | 621,384   | 584,285,409  | 584,285,409            | 662,526,948         | 504,659,958         |
| triticum_aestivum_lancer_A.fa         | 7               | 4,907,196,294  | 0.883395                 | 0.005918      | 29,042,896| 734,536,914  | 734,536,914            | 769,338,634         | 595,297,365         |
| triticum_aestivum_lancer_B.fa         | 7               | 5,013,902,246  | 0.876464                 | 0.006710      | 33,641,783| 702,438,406  | 702,438,406            | 839,470,345         | 665,179,885         |
| triticum_aestivum_lancer_D.fa         | 7               | 3,950,540,886  | 0.872088                 | 0.007500      | 29,627,862| 568,126,671  | 568,126,671            | 646,400,022         | 465,558,328         |
| triticum_aestivum_landmark_A.fa       | 7               | 4,966,053,268  | 0.881559                 | 0.016428      | 81,581,959| 740,148,362  | 740,148,362            | 791,748,890         | 595,339,094         |
| triticum_aestivum_landmark_B.fa       | 7               | 5,204,724,784  | 0.876961                 | 0.017895      | 93,136,106| 710,493,282  | 710,493,282            | 845,838,138         | 689,709,469         |
| triticum_aestivum_landmark_D.fa       | 7               | 3,982,871,035  | 0.871485                 | 0.019372      | 77,156,960| 570,643,040  | 570,643,040            | 656,817,438         | 484,551,304         |
| triticum_aestivum_mace_A.fa           | 7               | 4,897,709,906  | 0.882735                 | 0.006520      | 31,935,133| 732,118,298  | 732,118,298            | 782,694,893         | 590,561,804         |
| triticum_aestivum_mace_B.fa           | 7               | 5,127,197,460  | 0.878159                 | 0.007373      | 37,801,337| 704,156,067  | 704,156,067            | 848,590,828         | 667,607,564         |
| triticum_aestivum_mace_D.fa           | 7               | 3,937,477,063  | 0.873358                 | 0.008423      | 33,164,482| 567,265,955  | 567,265,955            | 650,274,702         | 475,327,881         |
| triticum_aestivum_mattis_A.fa         | 7               | 4,933,556,187  | 0.883870                 | 0.004501      | 22,204,811| 735,408,736  | 735,408,736            | 794,150,360         | 600,654,286         |
| triticum_aestivum_mattis_B.fa         | 7               | 5,126,139,104  | 0.879247                 | 0.005484      | 28,110,261| 799,857,935  | 698,878,671            | 969,998,116         | 467,876,140         |
| triticum_aestivum_mattis_D.fa         | 7               | 3,938,862,683  | 0.873429                 | 0.005638      | 22,207,647| 566,465,558  | 566,465,558            | 655,329,108         | 480,431,564         |
| triticum_aestivum_norin61_A.fa        | 7               | 4,921,847,059  | 0.875677                 | 0.005924      | 29,155,163| 723,255,126  | 723,255,126            | 781,462,734         | 594,006,513         |
| triticum_aestivum_norin61_B.fa        | 7               | 5,194,186,346  | 0.868098                 | 0.007617      | 39,566,671| 715,454,519  | 715,454,519            | 850,623,622         | 669,876,730         |
| triticum_aestivum_norin61_D.fa        | 7               | 3,941,626,919  | 0.865354                 | 0.007639      | 30,108,644| 564,869,106  | 564,869,106            | 650,275,864         | 478,264,344         |
| triticum_aestivum_paragon_A.fa        | 7               | 5,016,927,533  | 0.878688                 | 0.000004      | 18,600    | 759,055,895  | 759,055,895            | 795,989,443         | 599,230,268         |
| triticum_aestivum_paragon_B.fa        | 7               | 5,310,532,019  | 0.875044                 | 0.000014      | 73,000    | 733,835,468  | 733,835,468            | 872,909,281         | 688,536,368         |
| triticum_aestivum_paragon_D.fa        | 7               | 4,092,665,763  | 0.870697                 | 0.000002      | 8,800     | 586,077,705  | 586,077,705            | 670,531,570         | 499,575,344         |
| triticum_aestivum_renan_A.fa          | 7               | 4,966,282,335  | 0.877081                 | 0.014966      | 74,324,342| 746,502,734  | 746,502,734            | 792,837,209         | 593,930,347         |
| triticum_aestivum_renan_B.fa          | 7               | 5,216,673,246  | 0.872227                 | 0.016818      | 87,731,823| 717,542,863  | 717,542,863            | 854,463,248         | 673,746,810         |
| triticum_aestivum_renan_D.fa          | 7               | 4,012,688,034  | 0.868981                 | 0.022570      | 90,566,151| 569,771,178  | 569,771,178            | 661,835,603         | 493,761,083         |
| triticum_aestivum_stanley_A.fa        | 7               | 4,993,448,364  | 0.885270                 | 0.014144      | 70,628,727| 742,917,797  | 742,917,797            | 803,232,604         | 591,313,643         |
| triticum_aestivum_stanley_B.fa        | 7               | 5,227,912,278  | 0.880709                 | 0.015717      | 82,167,782| 715,714,221  | 715,714,221            | 856,542,542         | 697,113,365         |
| triticum_aestivum_stanley_D.fa        | 7               | 3,986,277,988  | 0.874331                 | 0.017540      | 69,917,336| 572,943,128  | 572,943,128            | 657,494,025         | 483,823,121         |
| triticum_dicoccoides_A.fa             | 7               | 4,899,336,816  | 0.883402                 | 0.014168      | 69,415,818| 726,427,787  | 726,427,787            | 775,183,943         | 593,586,810         |
| triticum_dicoccoides_B.fa             | 7               | 5,179,702,578  | 0.878532                 | 0.015650      | 81,060,827| 712,180,895  | 712,180,895            | 841,096,276         | 673,896,466         |
| triticum_spelta_A.fa                  | 7               | 4,900,765,403  | 0.880041                 | 0.005361      | 26,271,701| 737,453,356  | 737,453,356            | 782,685,093         | 583,494,258         |
| triticum_spelta_B.fa                  | 7               | 5,134,000,283  | 0.873542                 | 0.006527      | 33,510,942| 708,205,786  | 708,205,786            | 835,583,350         | 669,032,550         |
| triticum_spelta_D.fa                  | 7               | 3,965,570,715  | 0.869875                 | 0.007229      | 28,668,590| 573,398,137  | 573,398,137            | 648,139,033         | 471,251,328         |
| triticum_timopheevii_A.fa             | 7               | 4,849,233,683  | 0.874839                 | 0.000010      | 46,400    | 694,350,238  | 694,350,238            | 771,176,557         | 585,824,631         |
| triticum_timopheevii_B.fa             | 7               | 4,403,617,647  | 0.853180                 | 0.000047      | 205,500   | 643,128,204  | 643,128,204            | 692,654,486         | 495,016,746         |
| triticum_urartu.fa                    | 10,284          | 4,851,895,022  | 0.903672                 | 0.006242      | 30,285,589| 661,480,603  | 12,085                 | 753,719,114         | 1,728               |

Do you know why paffy tile (from cactus-blast) takes too much time (4 days+) and consumes more than 1TB of memory? Given the table above, was this an expected behaviour?

On our previous wheat alignment, we faced faster execution time (1.5 on average) for cactus-blast and less memory consumption with similar assembly stats numbers.

Thanks

glennhickey commented 3 days ago

paffy tile does not scale well with very large sets of pairwise alignments, ex #905 and #877. For vertebrates the solution as always been better masking. Switching to the RED preprocessor further helped for some t2t-genomes.

I have no experience aligning wheat, but obviously there is an issue here. How were the genomes masked? Is this data you've aligned successfully with an older version of Cactus? If so, perhaps activating the lastz preprocessor could help (though it would probably add much more compute than a 4-day paffy tile. Can you share a subset of this data to reproduce the error? If so I'll put it in the debugger and can work with @benedictpaten to hopefully add a more agressive filter or something to prevent these super long runtimes. Thanks. It's pretty easy to imagine this type of problem coming up somewhere in the VGP tree, so it's best to get it sorted out as soon as possible.

glennhickey commented 3 days ago

Looking up wheat it's apparently about 17Gb hexaploid. If your inputs are 5Gb, that would make them diploid? I could definitely see input genomes being diploid exacerbating the pairwise alignment situation.

Cactus has some support for diploid genomes, but you have to divide them up into haploid fasta files. You can see here for an example of a diploid primate alignment here: https://cgl.gi.ucsc.edu/data/cactus/t2t-apes/16-t2t-apes-2023v2/

So if your genomes are indeed diploid, splitting them up in this way is the first thing I'd try.

twalsh-ebi commented 2 days ago

Hi @glennhickey... @thiagogenez has asked me to add my two cents here. Unfortunately these Wheat genomes are not diploid in the sense that you describe, and each of the files — even the 5Gb ones — already represent a haploid subgenome. So unfortunately we wouldn't be able to split them up in that way.

glennhickey commented 2 days ago

@twalsh-ebi thanks for the confirmation. How were they masked? Can they be shared with me to reproduce?

twalsh-ebi commented 1 day ago

I didn't do the masking myself, @glennhickey, but my understanding is that they were masked using RED.

All but one of the soft-masked genome sequences are already on the Ensembl FTP site. I trust @thiagogenez will get in touch with you about the one that is not listed here.

glennhickey commented 1 day ago

Are there RepeatMasker libraries for wheat? This may be what's required (masking with RepeatMasker) to efficiently align these genomes with Cactus..