AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

Failed to make temporary directory #46

Closed susheelbhanu closed 1 year ago

susheelbhanu commented 2 years ago

Hi @AstrobioMike... Firstly thanks for this super intuitive tool. I've run gtotree in the past on the command and everything works without issues, however, now I'm trying to run it in a snakemake workflow and getting the following error:

                                  GToTree v1.6.12
                         (github.com/AstrobioMike/GToTree)

 ---------------------------------  RUN INFO  ---------------------------------

    Input genome sources include:
      - Fasta files listed in /mnt/data/sbusi/CoRefine/results/gtotree/H_S002_list.txt (110 genomes)

                             ESC[0;32mTotal input genomes: 110ESC[0m

    HMM source to be used:
      - /home/susheel.busi/miniconda3/envs/gtotree/share/gtotree/hmm_sets/Bacteria.hmm (74 targets)

    Options set:
      - The output directory has been set to "/mnt/data/sbusi/CoRefine/results/gtotree/H_S002/".
      - The file "/mnt/data/sbusi/CoRefine/results/gtotree/H_S002_mapping.txt" will be used to modify labels of the specified genomes.
      - Only generating alignment, no tree, as "-N" option has been provided.
      - Genome minimum gene-copy threshold ("-G") has been set to 0.1.
      - Number of jobs to run during parallelizable steps has been set to 12.

ESC[0;31m  Tried to make temporary directory named 1643149839.gtotree.tmpdir but failed, this shouldn't happen :(ESC[0m

The necessary permissions to the folder exist, so not sure why this is the case. And for the record, below is the snakemake rule:

rule gtotree:
    input:
        list=rules.mag_list.output.list,
        mapping=rules.mag_list.output.mapping
    output:
        directory(os.path.join(RESULTS_DIR, "gtotree/{sid}"))
    log:
        os.path.join(RESULTS_DIR, "logs/{sid}_gtotree.log")
    conda:
        os.path.join(ENV_DIR, "gtotree.yaml")
    threads:
        config["checkm"]["threads"]
    params:
        hmm=config["gtotree"]["hmm"],
        jobs=config["gtotree"]["jobs"],
        threshold=config["gtotree"]["threshold"],
        tree=config["gtotree"]["tree"]
    wildcard_constraints:
        sid="|".join(SAMPLES)
    message:
        "Checking quality of genomes from {wildcards.sid}"
    shell:
        "(date && GToTree -H {params.hmm} -o {output} -f {input.list} -m {input.mapping} -j {params.jobs} -G {params.threshold} {params.tree} && date) &> {log}"

Thank you!

AstrobioMike commented 2 years ago

Hey there, @susheelbhanu

Thanks for the kind words :)

Sorry it's giving you trouble in your nice workflow!

I have to admit, despite my love of snakemake, I've never put GToTree in one yet, ha. As soon as I can I'm going to test it out and hopefully I'll be able to recreate this and track the problem down. If not, maybe I can just add an argument to be able to specify the temp working directory we want ahead of time.

Thanks for writing in with the problem. I'll get back to you asap.

susheelbhanu commented 2 years ago

Thanks a lot Mike!

I think having a TMPDIR option may alleviate this issue, but for the record and your tests, doing the following, gave me the same error

shell:
        "(date && TMPDIR=tmp  GToTree -H {params.hmm} -o {output} -f {input.list} -m {input.mapping} -j {params.jobs} -G {params.threshold} {params.tree} && date) &> {log}"
susheelbhanu commented 2 years ago

Hey Mike.. Sorry for the spam, but thanks to a colleague who is a wizard with troubleshooting things @vgalata, we did some digging and there maybe potential issues with the naming of the tmp_dir in the bin/GToTree and bin/gtt-pfam-search files.

The idea of using snakemake was to launch multiple jobs in parallel, and the tmp_dir in the GToTree file looks like so:

tmp_dir=$(date +%s).gtotree.tmpdir
mkdir $tmp_dir 2> /dev/null

If I were to run 3 samples in parallel, snakemake launches all of them at the same time, so the date-time stamp on the tmpdir for each sample is the same, causing interference.

Potential workarounds:

  1. As you suggested, being able to set a tmp_dir variable for each sample or modifying the original code such that unique names are created everytime a tmp_dir is made.

  2. Modifying the snakemake rule (shown below), which resolves the issue. Please note the change in the directory where GTT is run.

    rule gtotree:
    input:
        list=rules.mag_list.output.list,
        mapping=rules.mag_list.output.mapping
    output:
        directory(os.path.join(RESULTS_DIR, "gtotree/{sid}/gtt_results"))
    log:
        os.path.join(RESULTS_DIR, "logs/{sid}_gtotree.log")
    conda:
        os.path.join(ENV_DIR, "gtotree.yaml")
    threads:
        config["gtotree"]["threads"]
    params:
        hmm=config["gtotree"]["hmm"],
        jobs=config["gtotree"]["jobs"],
        threshold=config["gtotree"]["threshold"],
        tree=config["gtotree"]["tree"]
    wildcard_constraints:
        sid="|".join(CAMI_SAMPLES)
    message:
        "Checking quality of CAMI genomes from {wildcards.sid}"
    shell:
        "log_file=$(realpath {log}) && mkdir -p $(dirname {output}) && cd $(dirname {output}) && "
        "(date && GToTree -H {params.hmm} -o $(basename {output}) -f {input.list} -m {input.mapping} -n {threads} -j {params.jobs} -G {params.threshold} {params.tree} && date) &> ${{log_file}}"

Hope this helps with your troubleshooting and for a nice tool.

Best, Susheel

P.S. Will leave comment open till you've had a chance to verify the resolution ;)

AstrobioMike commented 2 years ago

I was literally just copying the line in the code to note to you that GToTree doesn't use the system temp (I did this in case wanting to keep the working directory), and to ask if multiple might be launching at the same time, haha :) Thanks @VGalata for the help!

Maybe I should move the temp directory into the output directory, that would keep the path unique even in cases like this.

Thanks again!

susheelbhanu commented 2 years ago

Hahaha perfect timing then. And yes, moving the tmpdir to the outdir might be the easiest way to circumvent this. Thanks again for the prompt help with this!

VGalata commented 2 years ago

Dear @AstrobioMike,

I would suggest to use mktemp to create temporary directories. This should be a safer option irrespective of where exactly the directory is being created (/tmp, PWD or output folder). The prefix is per default TMPDIR and you could set its value to be the output folder if this variable is empty. This would give the user more control over the created temp files.

deanpettinga commented 1 year ago

Hi @AstrobioMike,

I recently discovered this issue myself when running multiple GToTree instances in parallel via snakemake. I was wondering if you had made any updates with regard to the suggestions made by yourself and @VGalata. I would find this really helpful in my current and future projects ;)

Thanks for creating and supporting this tool! i'm really enjoying it!

cheers, Dean

AstrobioMike commented 1 year ago

Heya, @deanpettinga :)

I did not end up implementing the more appropriate mechanism that @VGalata so kindly informed me about :/

I'm kind of surprised I just closed this when @susheelbhanu had a workaround...

Thanks for pinging this issue thread about it again. I will implement the 'mktemp' method ASAP and let you know when a new version is ready – hopefully today or tomorrow

deanpettinga commented 1 year ago

@AstrobioMike

great news! thanks for continuing to support this tool!

In the meantime, I also ended up writing a workaround into my snakemake workflow, but i think the fix will really improve the experience for folks who try to run this tool in parallel in the future.

thank you!

AstrobioMike commented 1 year ago

Totally agree, @deanpettinga!

Creating the temp directory is now (more appropriately) implemented with mktemp as of v1.7.08 👍

Thanks again for writing in about it :)