Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

Long runtime for 3 Gb genome #34

Closed ASLeonard closed 1 year ago

ASLeonard commented 1 year ago

Hi Katharina,

I had some questions regarding optimal running of GALBA for large mammalian genomes (~3 Gb). Currently GALBA downsamples to 8000 if there are too many training genes are identified out of miniprot

https://github.com/Gaius-Augustus/GALBA/blob/2b1f8253b326cbe47d3a6f2acbadcf08cb7a6420/scripts/galba.pl#L4123

I came across this (but not many other supporting claims) suggesting that training with more than 1000 has limited benefit. Do you know if that is still generally true, or perhaps the downsampling limit can be exposed as a parameter? I'd rather get a "pretty good" annotation done in reasonable time than the "best possible" annotation being killed after 24 hours on 8 cores.

The other question is the manuscript discusses runtimes when using 72 cores, but the README seems to suggest that running with more than 8 has limited benefit due to optimize_augustus.pl and other steps only using up to 8 cores. Is one of these statements more useful to follow?

@CEPHAS-01, did you get GALBA to finish on your assemblies (and if so, what was the CPU time)?

Best, Alex

KatharinaHoff commented 1 year ago

Dear Alex,

in the Supplementary of the BRAKER2 paper is a Figure S3 that answers the question of how much "incremental improvement' you get by going beyond 1000 training genes:

image

This is data from A. thaliana, we did the same experiment with other species. It look similar (not identical) across different species. Accuracy improvements are possible if you go beyond 1000 training genes. We used to make that claim ourselves, but it was years ago, and my perspective is that we had limited data when we came up with the idea that 1000 genes were "usually enough".

optimize_augustus.pl splits the training genes into buckets for k-fold cross validation. If you have a large number of training genes, it is usually ok to split into more than 8 buckets. GALBA ensures that k is set to a value such that each bucket has a minimum of 200 genes. It will thus create at most 40 buckets/occupy 40 threads if you have 8000 training genes. I routinely run GALBA (and BRAKER) with 48 threads (because that's our infrastructure on the small nodes, and going to e.g. 256 threads is indeed a waste of resources).

On Wed, Jun 14, 2023 at 1:50 PM Alex Leonard @.***> wrote:

Hi Katharina,

I had some questions regarding optimal running of GALBA for large mammalian genomes (~3 Gb). Currently GALBA downsamples to 8000 if there are too many training genes are identified out of miniprot

https://github.com/Gaius-Augustus/GALBA/blob/2b1f8253b326cbe47d3a6f2acbadcf08cb7a6420/scripts/galba.pl#L4123

I came across this https://darencard.net/blog/2020-07-23-augustus-optimization/ (but not many other supporting claims) suggesting that training with more than 1000 has limited benefit. Do you know if that is still generally true, or perhaps the downsampling limit can be exposed as a parameter? I'd rather get a "pretty good" annotation done in reasonable time than the "best possible" annotation being killed after 24 hours on 8 cores.

The other question is the manuscript discusses runtimes when using 72 cores, but the README seems to suggest that running with more than 8 has limited benefit due to optimize_augustus.pl and other steps only using up to 8 cores. Is one of these statements more useful to follow?

@CEPHAS-01 https://github.com/CEPHAS-01, did you get GALBA to finish on your assemblies (and if so, what was the CPU time)?

Best, Alex

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/GALBA/issues/34, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JEMKVBDF7N7I3224ETXLGQRRANCNFSM6AAAAAAZGHTORU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ASLeonard commented 1 year ago

Thanks for the detailed response. I'll bump the threads up towards 40.

Also for the BRAKER2 Figure S3, I (naively) would be happy getting ~54% accuracy w/ 1000 training genes compared to ~58% accuracy w/ 8000 training genes. Assuming there is somewhat linear scaling with genes and the optimising/training stages seem to be bottlenecks, I would happily trade 4% accuracy for 5x speed up. I'm looking at annotation as a bonus on multiple new assemblies, rather than annotating a reference to high quality.

Do you think subsetting to 1000 training genes would be appropriate for that goal? Hopefully I can get the normal version with 8k to finish and then compare with 1k.

KatharinaHoff commented 1 year ago

Depends on who wants to work with the annotation in the future... some people are unhappy if something is wrong in structural annotation (and a lot will be wrong either way, we can't fix everything, regardless of the training gene number).

Extrinsic evidence will not fully restore a loss in ab initio accuracy.

On Wed, Jun 14, 2023 at 4:40 PM Alex Leonard @.***> wrote:

Thanks for the detailed response. I'll bump the threads up towards 40.

Also for the BRAKER2 Figure S3, I (naively) would be happy getting ~54% accuracy w/ 1000 training genes compared to ~58% accuracy w/ 8000 training genes. Assuming there is somewhat linear scaling with genes and the optimising/training stages seem to be bottlenecks, I would happily trade 4% accuracy for 5x speed up. I'm looking at annotation as a bonus on multiple new assemblies, rather than annotating a reference to high quality.

Do you think subsetting to 1000 training genes would be appropriate for that goal? Hopefully I can get the normal version with 8k to finish and then compare with 1k.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/GALBA/issues/34#issuecomment-1591359221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JBAMNOASA3QPBR5DODXLHEN5ANCNFSM6AAAAAAZGHTORU . You are receiving this because you commented.Message ID: @.***>

CEPHAS-01 commented 1 year ago

@CEPHAS-01, did you get GALBA to finish on your assemblies (and if so, what was the CPU time)?

Hi Alex,

The run did not complete due to the error I described earlier in the thread here, so it is difficult to estimate the CPU time for the run. You may be able to successfully complete the run, and I would love to hear your feedback on this.

Warm regards, Temitayo

ASLeonard commented 1 year ago

In the end with 32 threads this took about 44 wall hours (1086 CPU hours) peaking at 115 GB of RAM. So increasing the threads was quite useful in the end. It looks like GALBA predicts too many genes, although some script (maybe from some BRAKER issue?) reports that about 1/3 of the genes have low/no evidence from the hints, which brings it closer to expectations.