Closed susheelbhanu closed 1 year ago
Hey there, @susheelbhanu
Thanks for the kind words :)
Sorry it's giving you trouble in your nice workflow!
I have to admit, despite my love of snakemake, I've never put GToTree in one yet, ha. As soon as I can I'm going to test it out and hopefully I'll be able to recreate this and track the problem down. If not, maybe I can just add an argument to be able to specify the temp working directory we want ahead of time.
Thanks for writing in with the problem. I'll get back to you asap.
Thanks a lot Mike!
I think having a TMPDIR option may alleviate this issue, but for the record and your tests, doing the following, gave me the same error
shell:
"(date && TMPDIR=tmp GToTree -H {params.hmm} -o {output} -f {input.list} -m {input.mapping} -j {params.jobs} -G {params.threshold} {params.tree} && date) &> {log}"
Hey Mike.. Sorry for the spam, but thanks to a colleague who is a wizard with troubleshooting things @vgalata, we did some digging and there maybe potential issues with the naming of the tmp_dir
in the bin/GToTree
and bin/gtt-pfam-search
files.
The idea of using snakemake was to launch multiple jobs in parallel, and the tmp_dir
in the GToTree file looks like so:
tmp_dir=$(date +%s).gtotree.tmpdir
mkdir $tmp_dir 2> /dev/null
If I were to run 3 samples in parallel, snakemake launches all of them at the same time, so the date-time
stamp on the tmpdir
for each sample is the same, causing interference.
Potential workarounds:
As you suggested, being able to set a tmp_dir
variable for each sample or modifying the original code such that unique names are created everytime a tmp_dir
is made.
Modifying the snakemake rule (shown below), which resolves the issue. Please note the change in the directory where GTT
is run.
rule gtotree:
input:
list=rules.mag_list.output.list,
mapping=rules.mag_list.output.mapping
output:
directory(os.path.join(RESULTS_DIR, "gtotree/{sid}/gtt_results"))
log:
os.path.join(RESULTS_DIR, "logs/{sid}_gtotree.log")
conda:
os.path.join(ENV_DIR, "gtotree.yaml")
threads:
config["gtotree"]["threads"]
params:
hmm=config["gtotree"]["hmm"],
jobs=config["gtotree"]["jobs"],
threshold=config["gtotree"]["threshold"],
tree=config["gtotree"]["tree"]
wildcard_constraints:
sid="|".join(CAMI_SAMPLES)
message:
"Checking quality of CAMI genomes from {wildcards.sid}"
shell:
"log_file=$(realpath {log}) && mkdir -p $(dirname {output}) && cd $(dirname {output}) && "
"(date && GToTree -H {params.hmm} -o $(basename {output}) -f {input.list} -m {input.mapping} -n {threads} -j {params.jobs} -G {params.threshold} {params.tree} && date) &> ${{log_file}}"
Hope this helps with your troubleshooting and for a nice tool.
Best, Susheel
P.S. Will leave comment open till you've had a chance to verify the resolution ;)
I was literally just copying the line in the code to note to you that GToTree doesn't use the system temp (I did this in case wanting to keep the working directory), and to ask if multiple might be launching at the same time, haha :) Thanks @VGalata for the help!
Maybe I should move the temp directory into the output directory, that would keep the path unique even in cases like this.
Thanks again!
Hahaha perfect timing then. And yes, moving the tmpdir
to the outdir
might be the easiest way to circumvent this. Thanks again for the prompt help with this!
Dear @AstrobioMike,
I would suggest to use mktemp to create temporary directories. This should be a safer option irrespective of where exactly the directory is being created (/tmp
, PWD or output folder). The prefix is per default TMPDIR
and you could set its value to be the output folder if this variable is empty. This would give the user more control over the created temp files.
Hi @AstrobioMike,
I recently discovered this issue myself when running multiple GToTree instances in parallel via snakemake. I was wondering if you had made any updates with regard to the suggestions made by yourself and @VGalata. I would find this really helpful in my current and future projects ;)
Thanks for creating and supporting this tool! i'm really enjoying it!
cheers, Dean
Heya, @deanpettinga :)
I did not end up implementing the more appropriate mechanism that @VGalata so kindly informed me about :/
I'm kind of surprised I just closed this when @susheelbhanu had a workaround...
Thanks for pinging this issue thread about it again. I will implement the 'mktemp' method ASAP and let you know when a new version is ready – hopefully today or tomorrow
@AstrobioMike
great news! thanks for continuing to support this tool!
In the meantime, I also ended up writing a workaround into my snakemake workflow, but i think the fix will really improve the experience for folks who try to run this tool in parallel in the future.
thank you!
Totally agree, @deanpettinga!
Creating the temp directory is now (more appropriately) implemented with mktemp
as of v1.7.08 👍
Thanks again for writing in about it :)
Hi @AstrobioMike... Firstly thanks for this super intuitive tool. I've run
gtotree
in the past on the command and everything works without issues, however, now I'm trying to run it in asnakemake
workflow and getting the following error:The necessary permissions to the folder exist, so not sure why this is the case. And for the record, below is the snakemake rule:
Thank you!