gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
67 stars 22 forks source link

Integron_Finder bogs down server's I/O when ran on metagenomes with ~million contigs #90

Open JuntaoZhong opened 3 years ago

JuntaoZhong commented 3 years ago

Version of Integron_Finder:

version 2

OS

Linux

Problem1: I/O overloaded for metagenomes with hundreds of thousands of contigs.

Usually, only a small subset of the contigs have integron in them. However, integron_finder creates a separate tmp file for each contig, regardless of whether it finds integron in them.

So, that means hundreds of thousands of files all in one folder at the end of the program. And when the script tries to coalesce these files, there was an I/O overload, which prevents other users from doing basic commands like ls/cd/rm.

My temporary solution (included in the script attached):

I looked into /integron_finder/scripts/finder.py, and modify the script such that it does not create tmp files for a contig unless it finds integron in it.

Problem2: Program fails to remove tmp folders because the files are still open at the time of deletion

My tmp solution:

execute the remove command at the very end of the for loop, when I am sure that the program has closed all files in the tmp folder. [line 604-615 in my script]

Feel free to laugh at/despite my rudimentary coding skill!

finder_jimmy.py.txt

bgruening commented 1 year ago

@bneron we see this as well with large Galaxy input files. Million of files in one directory are hard to handle for any filesystem. Is there any way we could do less io in the first place? Cache the results in memory and only write if they are any or enough to write them out?

Matt-BF commented 1 year ago

Hi! I am actually having similar problems as this issue for metagenomic contigs. Currently Integron_Finder generates summary and output files even when there are no Integrons found, and is unable to remove the tmp directories, resulting in thousands of files and directories.I tried JuntaoZhong's solution, but I assume the script was written for an older version of Integron_Finder and wasn't able to make it work out for my use case

jeanrjc commented 12 months ago

Hello, we'll look into that one we have some time. Meanwhile, you can submit pull requests if you can. Thanks for reporting