Open fubar2 opened 1 month ago
It might be quite reasonable to split by chromsomes, should be able to do this: https://usegalaxy.org/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fbgruening%2Fsplit_file_to_collection%2Fsplit_file_to_collection%2F0.5.2&version=latest That'll both be faster and more memory efficient.
That's an interesting option to pursue @mvdbeek. Thanks! Fasta contigs can be concatenated, but joining dozens of GFF with headers will probably need a new tool so probably not practicable for me - but if someone wants to take care of thatprepare a demonstration, it could be a solution.
Since it works fine on EU and there are other things to do, I'll remove it from the workflow for now, until that's done.
@mvdbeek: Here's why that job failed OOM with 59GB - run on a local Galaxy with 12 cores so < 1 GB RAM for most of the run, but right at the end RAM seems to blow out - max at the end ~63 or so GB - just a few more would probably work on .org
@mvdbeek: TreeValGal ignores the fasta output you might be assuming and only uses the GFF3.
A test at the GMOD gff3 tester shows that concatenating 2 or more GFF3, each with correct headers, will create an invalid GFF3. The message explains that it can be fixed and correctly ordered with one of their tools. If someone wants to wrap that new tool, it could be a solution. Sounds like more work than getting the allocation right.
Do you have maybe the top 100 lines of 2 valid GFF files ? Nothing I find on the web actually validates against https://genometools.org/cgi-bin/gff3validator.cgi. https://usegalaxy.org/u/marius/w/merge-gff3 probably works, but hard to test if nothing actually validates. And the one file I fixed up manually complains about overlapping ids when I duplicate it 😆
Ugh, this was hard, but finally I got 2 input files that actually validate. Here's an example run https://usegalaxy.org/workflows/invocations/84e15596bd4fc608?from_panel=true
@mvdbeek: Thanks! Will give that a try tomorrow.
@mvdbeek: More and more layers - it's not that simple of course. Ignoring the gff fixer for a moment for simplicity, a contig split repeatmasker test with a 500MB fish fasta fails red on usegalaxy.org.
I can increase this of course but I'm very confused since afaict EU allocates only 40 GB (it is in their local tools.yml but it doesn't look like they override memory).
@fubar do you have a run on EU you can check the memory allocation/usage of?
Ah I forgot about their automatic resubmission.
Bumped to 76GB.
For efficiency, @mvdbeek's solution for getting a valid GFF after splitting into contigs could be very helpful. Now that it seems to have enough RAM, the WF starts and some parts run, but it does not end well. Repeatmasker is a very unruly tool but not sure how much more effort it deserves - unless this stress test provides a useful edge case for workflow job submission?
@natefoo: Sadly https://vgp.usegalaxy.org/datasets/f9cad7b01a4721353343582b8c4d1cc2/preview job ended green but with empty outputs ~28 hours after starting with mongo RAM allocation. See @mvdbeek's sensible map reduce suggestion and the conclusion of an attempt at implementing it above.
No need for more effort trying to tame this unruly tool for VGP scale operation. TreeValGal still has a windowmasker model free repeat density bigwig - so not crucial.
OTOH: If repeatmasker's dodgy code is effectively and properly isolated as a tool, maybe the failing workflow here is useful as an edge case for testing extremely resource hungry hammering during workflow invocation over a collection.
Currently, repeatmasker_wrapper has
A single chromosome works but a whole VGP haplotype fails OOM. Currently trying to get a RAM graph from running the same job but will take a while.