EI-CoreBioinformatics / reat

Robust Eukaryotic Annotation Toolkit
https://reat.readthedocs.io/en/latest/
MIT License
17 stars 3 forks source link

Fix minimap index issue with large genomes #30 #31

Closed gemygk closed 2 years ago

gemygk commented 2 years ago

Thanks @ljyanesm for adding the changes.

I will merge this into main.

ljyanesm commented 2 years ago

Is there a chance you could spend a bit of time checking why the pipeline failed? I suspect the parameter may be causing mm2 to use more memory and its not completing correctly now, but haven't had a chance to check.

On Fri, 24 Jun 2022, 13:55 Gemy George Kaithakottil, < @.***> wrote:

Merged #31 https://github.com/EI-CoreBioinformatics/reat/pull/31 into main.

— Reply to this email directly, view it on GitHub https://github.com/EI-CoreBioinformatics/reat/pull/31#event-6874060584, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMBUESRQKKI3AUJCYLDZTVQWV4VANCNFSM5ZXYNWRA . You are receiving this because you were mentioned.Message ID: @.***>

gemygk commented 2 years ago

As David mentioned in the issue #30, minimap2 creates multi-part index when the genome is greater than 4Gbs (which is the default) causing mangled SAM headers. When David tested it, increasing -I 128G keeps the same single index as before and minimap2 works fine. The memory used to index Wheat using Hisat was ~100-125Gb and with minimap2 is ~90-100Gb, which is expected in this case.

Having said that, I have not tried to fix the SAM headers with --split-prefix (with any additional steps as required) which would be ideal to do at a later stage.

ljyanesm commented 2 years ago

Sorry, my bad didn't realise the pipeline error was an indentation issue. You've already fixed this.

Many thanks!

On Fri, 24 Jun 2022, 14:53 Gemy George Kaithakottil, < @.***> wrote:

As David mentioned in the issue #30 https://github.com/EI-CoreBioinformatics/reat/issues/30, minimap2 creates multi-part index when the genome is greater than 4Gbs (which is the default) causing mangled SAM headers. When David tested it, increasing -I 128G keeps the same single index as before and minimap2 works fine. The memory used to index Wheat using Hisat was ~100-125Gb and with minimap2 is ~90-100Gb, which is expected in this case.

Having said that, I have not tried to fix the SAM headers with --split-prefix (with any additional steps as required) which would be ideal to do at a later stage.

— Reply to this email directly, view it on GitHub https://github.com/EI-CoreBioinformatics/reat/pull/31#issuecomment-1165599622, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMBUC26VCIJ3BHW5EIHG3VQW4XHANCNFSM5ZXYNWRA . You are receiving this because you were mentioned.Message ID: @.***>

gemygk commented 2 years ago

Oh yes, I thought you were talking about the actual pipeline 🙂.