Closed jmtsuji closed 5 years ago
@LeeBergstrand Just had a brainwave. I haven't tested this yet, but it's probably the tempQuery.faa
file generated when BackBLAST.py
runs. There would be a file collision of two tempQuery.faa
files when two simultaneous instances of BackBLAST.py
run in the same folder at the same time.
Are you okay if I modify BackBLAST.py
so that it adds a random string at the end of the tempQuery.faa
filename to avoid the collision?
Yes! I agree that’s what I thought it was. That should work.
Sent from my iPhone
On Aug 7, 2019, at 6:55 AM, Jackson M. Tsuji notifications@github.com wrote:
@LeeBergstrand Just had a brainwave. I haven't tested this yet, but it's probably the tempQuery.faa file generated when BackBLAST.py runs. There would be a file collision of two tempQuery.faa files when two simultaneous instances of BackBLAST.py run in the same folder at the same time.
Are you okay if I modify BackBLAST.py so that it adds a random string at the end of the tempQuery.faa filename to avoid the collision?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@jmtsuji You might be able to use one subject file name + temp.faa or something like that as each temp file should be unique to each subject.
@LeeBergstrand PR created that addresses this issue. I used a thread identifier to make each temp filename unique. I worry that the subject file name might still result in a collision one day (e.g., if we eventually add support for multiple query files). The temp filename is specified in the logfile so that it can be traced to a specific run during debugging, if needed.
P.S. I've tested this fix, and it solves the problem. Now getting reproducible results!
@LeeBergstrand This is a bit of a mystery to me.
Background
BackBLAST.py
runs with 1 thread. In the snakemake workflow, I allow the user to run multiple instances ofBackBLAST.py
at a time (via the snakemake--jobs
flag) to perform analyses in parallel.Problem description
When
--jobs
is 1 (i.e., 1 process runs at a time), pipeline results match what I'd expect. For example:This figure is reproducible over multiple runs.
However, when
--jobs
is > 1, I get different result each time. Here are three runs with identical settings run with--jobs 4
(i.e., 4 single-threaded processes can run at once):Specifically, it appears that some genomes randomly have empty results from
BackBLAST.py
.BackBLAST.py
successfully ran (i.e., the log file looks fine), but the output file is empty.Any idea what could be behind this issue (or keywords to look up)? Thanks.