emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
https://metagoflow.readthedocs.io
Apache License 2.0
7 stars 8 forks source link

Multiple tmp/data folders #16

Open jprmachado opened 1 year ago

jprmachado commented 1 year ago

When running parallel from cwltool after running multiple instances of IPS the tmp remains with all leftovers of the data required for the analyses.

Adjusting the parameters to:

+protein_chunk_size_IPS: 20000 +protein_chunk_size_eggnog: 1000 +protein_chunk_size_hmm: 500

This forces the analysis to run 19 instances of IPS docker rather than 1 using the default values when testing with the input:

test????_HWLTKDRXY_600000.fastq.gz

The analyses leftover jumped from ~0.3T to ~1.5T. As far as I can see all the content tmp/*/data folders are the same.

Altering to use the same folder probably needs to be done at cwltool source code, the other possible solution is to clean after the IPS instance destruction.

cymon commented 9 months ago

The solution to this is to delete the InterProScan databases that are not need - this drastically reduces the amount of data that is replicated for each scatter/chunk analysis - the total /tmp output is then in the 10s of GB rather than 1-2TB.

(Amusingly the newer InterProScan data db's include some 80K files in one (unused) db. In analysis where I had many chunks the file count exceed the ulimit (4 million) on the HPC I was using.)

We should include a line to delete unneeded IPS db's at the time of installation.