endixk / ezaai

EzAAI - High Throughput Prokaryotic AAI Calculator
http://leb.snu.ac.kr/ezaai
GNU General Public License v3.0
36 stars 4 forks source link

AAI calculation error: tmp file issue? #6

Closed bluegenes closed 2 years ago

bluegenes commented 2 years ago

Hi folks,

I'm trying to use a snakemake workflow to run a number of EzAAI jobs. However, I'm getting an error during extract that makes me a bit worried about whether or not the files will always be right.

For extract, I'm generating a temporary gunzipped file, since I have all references available as .fna.gz. Here's an example command.

gunzip -c /path/to/ref/files/GCA_009909065.1_genomic.fna.gz > GCA_009909065.1.tmp.fna

java -jar EzAAI_latest.jar extract -i GCA_009909065.1.tmp.fna -o GCA_009909065.1.db -l GCA_009909065.1 > GCA_009909065.1.extract.log

rm GCA_009909065.1.tmp.fna

The error I'm (sometimes!) getting is the following:

java.io.FileNotFoundException: /tmp/prodigal.faa (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileReader.<init>(FileReader.java:75)
        at leb.process.ProcCDSPredictionByProdigal.execute(ProcCDSPredictionByProdigal.java:147)
        at leb.main.EzAAI.runExtract(EzAAI.java:224)
        at leb.main.EzAAI.run(EzAAI.java:482)
        at leb.main.EzAAI.main(EzAAI.java:518)

If I'm running more than one EzAAI job on the same node, how would that affect the /tmp/prodigal.faa file for each extraction? Will the later job always fail without overwriting the first /tmp/prodigal.faa file? Running failed jobs a second time usually resolves the issue.

Note, I am running EzAAI in the following conda environment:

conda list -p  /home/ntpierce/miniconda3/7841dd127abab0c21fbc5a5b78f2aefd
# packages in environment at /home/ntpierce/miniconda3/7841dd127abab0c21fbc5a5b78f2aefd:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2021.10.8            ha878542_0    conda-forge
gawk                      5.1.0                h7f98852_0    conda-forge
gettext                   0.19.8.1          h73d1719_1008    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_12    conda-forge
libgomp                   11.2.0              h1d223b6_12    conda-forge
libidn2                   2.3.2                h7f98852_0    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_12    conda-forge
libunistring              0.9.10               h7f98852_0    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
mmseqs2                   13.45111             h95f258a_1    bioconda
openssl                   3.0.0                h7f98852_2    conda-forge
prodigal                  2.6.3                h779adbc_3    bioconda
wget                      1.20.3               ha35d2d1_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
bluegenes commented 2 years ago

Upon further testing with calculate, I do think something is going wrong with db generation using this strategy. When I use the db files generated with simultaneous snakemake jobs distributed across the cluster, I often get the error below. When I regenerate the db files in an interactive session/ without running simultaneous jobs, calculate works as intended.

So it seems multiple extract jobs cannot be run simultaneously. Please let me know if you have any suggestions.

calculate error:

java.lang.ArithmeticException: / by zero
        at leb.process.ProcCalcPairwiseAAI.calcIdentityWithDetails(ProcCalcPairwiseAAI.java:462)
        at leb.process.ProcCalcPairwiseAAI.pairwiseMmseqs(ProcCalcPairwiseAAI.java:643)
        at leb.process.ProcCalcPairwiseAAI.calculateProteomePairWithDetails(ProcCalcPairwiseAAI.java:250)
        at leb.main.EzAAI.runCalculate(EzAAI.java:351)
        at leb.main.EzAAI.run(EzAAI.java:483)
        at leb.main.EzAAI.main(EzAAI.java:518)

Follow-up question: I assume this error means there is no shared similarity, so AAI cannot be calculated. Is this the case, or do you report no similarity as 0 in the output file? It would be great to have a 0 value reported (and normal program exit) if it's not a different error! Happy to drop this in a separate issue if it would be helpful.

endixk commented 2 years ago

Hello, Thank you so much for these detailed reports.

First issue was caused by the simple code mistake, which gave the temporary files common names. By this, the error you mentioned occurred because one of your sessions tried to access and wipe the prodigal output produced by the completely different session.

I fixed the code to give temporary files properly randomized names so that the sessions won't interrupt each other.

Also for the second issue, I provided a few lines of fail-safe code for zero-division cases.

The new version of EzAAI has been uploaded in our website, which can also be downloaded using following link: Download

Thanks!

agg437 commented 2 years ago

Maybe it is something about .fna files, I got similar errors with .fna files but when I tried with .faa files from prokka it worked without errors. Although I was not running simultaneous jobs. I hope it is helpful.

bluegenes commented 2 years ago

Thanks for the fixes, folks! I ended up running all my extract steps independently, but will make sure to go back and double check your fix with some additional files.

One more follow-up question:

Temp filenames seem random for calculate, but I am occasionally running into a similar error:

java.io.FileNotFoundException: /tmp/4b5f93eeab2eefee_faa/j0.faa (No such file or directory)
        at java.base/java.io.FileInputStream.open0(Native Method)
        at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
        at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
        at java.base/java.io.FileReader.<init>(FileReader.java:60)
        at leb.process.ProcCalcPairwiseAAI.pairwiseMmseqs(ProcCalcPairwiseAAI.java:579)
        at leb.process.ProcCalcPairwiseAAI.calculateProteomePairWithDetails(ProcCalcPairwiseAAI.java:250)
        at leb.main.EzAAI.runCalculate(EzAAI.java:361)
        at leb.main.EzAAI.run(EzAAI.java:493)
        at leb.main.EzAAI.main(EzAAI.java:528)

Do you think a similar issue could be happening, e.g. if filenames are not fully randomized? I was running ~30 jobs at once, and this error was cropping up pretty often. Again, re-running usually "solves" the issue (program exits without error).

Note this is with EzAAI_v1.11.jar, and running a single instance at once results in no errors.

endixk commented 2 years ago

The .faa generation issue could occur when you run multiple calculate module simultaneously, because the module was not designed to handle multiple sessions.

.db file, which is an input of calculate module, is simply a compressed tarball containing mmseqs output files containing common named contents. Multiple sessions of calculate modules will try to manipulate multiple .db files into a single directory, therefore any session can easily overwrite/remove the files belong to another session.

To prevent this from happening, please run the calculate module with -t [THREAD] argument, instead of running multiple sessions simultaneously, to utilize the multi-threading option that removes the risk of such issue while maintaining the throughput of the analysis.

I have a plan to develop a multi-threading option for extract module as well, to make our pipeline consistent.

Thanks again and any further feedback will be welcomed!

bluegenes commented 2 years ago

Ok, thanks - this is very important to know!

I'm not sure how common my use case is relative to others -- I have a series of specific pairwise comparisons I'm interested in, rather than a large all x all comparison. In any case, multithreading and then running each process sequentially worked, though slower than spamming jobs across a large cluster :). Thanks!