LeeBergstrand / BackBLAST_Reciprocal_BLAST

This repository contains a reciprocal BLAST program for filtering down BLAST results to best bidirectional hits. It also contains a toolkit for finding and visualizing BLAST hits for gene clusters within multiple bacterial genomes.
MIT License
14 stars 8 forks source link

Unusual behaviour in BackBLAST.py when run in multiple instances #34

Closed jmtsuji closed 5 years ago

jmtsuji commented 5 years ago

@LeeBergstrand This is a bit of a mystery to me.

Background

BackBLAST.py runs with 1 thread. In the snakemake workflow, I allow the user to run multiple instances of BackBLAST.py at a time (via the snakemake --jobs flag) to perform analyses in parallel.

Problem description

When --jobs is 1 (i.e., 1 process runs at a time), pipeline results match what I'd expect. For example: BackBLAST_heatmap_1thread

This figure is reproducible over multiple runs.

However, when --jobs is > 1, I get different result each time. Here are three runs with identical settings run with --jobs 4 (i.e., 4 single-threaded processes can run at once): BackBLAST_heatmap_4jobs_run1 BackBLAST_heatmap_4jobs_run2 BackBLAST_heatmap_4jobs_run4

Specifically, it appears that some genomes randomly have empty results from BackBLAST.py. BackBLAST.py successfully ran (i.e., the log file looks fine), but the output file is empty.

Any idea what could be behind this issue (or keywords to look up)? Thanks.

jmtsuji commented 5 years ago

@LeeBergstrand Just had a brainwave. I haven't tested this yet, but it's probably the tempQuery.faa file generated when BackBLAST.py runs. There would be a file collision of two tempQuery.faa files when two simultaneous instances of BackBLAST.py run in the same folder at the same time.

Are you okay if I modify BackBLAST.py so that it adds a random string at the end of the tempQuery.faa filename to avoid the collision?

LeeBergstrand commented 5 years ago

Yes! I agree that’s what I thought it was. That should work.

Sent from my iPhone

On Aug 7, 2019, at 6:55 AM, Jackson M. Tsuji notifications@github.com wrote:

@LeeBergstrand Just had a brainwave. I haven't tested this yet, but it's probably the tempQuery.faa file generated when BackBLAST.py runs. There would be a file collision of two tempQuery.faa files when two simultaneous instances of BackBLAST.py run in the same folder at the same time.

Are you okay if I modify BackBLAST.py so that it adds a random string at the end of the tempQuery.faa filename to avoid the collision?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

LeeBergstrand commented 5 years ago

@jmtsuji You might be able to use one subject file name + temp.faa or something like that as each temp file should be unique to each subject.

jmtsuji commented 5 years ago

@LeeBergstrand PR created that addresses this issue. I used a thread identifier to make each temp filename unique. I worry that the subject file name might still result in a collision one day (e.g., if we eventually add support for multiple query files). The temp filename is specified in the logfile so that it can be traced to a specific run during debugging, if needed.

jmtsuji commented 5 years ago

P.S. I've tested this fix, and it solves the problem. Now getting reproducible results!