cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121 stars 6 forks source link

Question about setting own tmp dir #23

Closed xxYaaoo closed 4 months ago

xxYaaoo commented 7 months ago

Hi~

Recently, I'm struggling with the problem that I have to set my own tmp directory while running the GraffiTE, because of the limited access authority in group server. I used the command line ‘export NXF_TEMP=’ in my slurm script to set the tmp dir. However, the squeue showed that my task job was in normally running state, but the output dir contained nothing. I also tried the way you mentioned in the ‘important note' to revise the nextflow.config, but the slurm task showed error as the moment I sbatched my work. Any idea could figure my problem out?

Thank you so much!

xxYaaoo commented 7 months ago

Dear Professor Cristian,

Update my error. The output dir finally contained two folders [SV_search and Repeat_Filting], but my task still failed. image image Any suggestion to solve this problem?~

Very thankful~!

cgroza commented 7 months ago

Hi

May I see your nextflow.config?

cgroza commented 7 months ago

Also, I pushed a commit that may fix the error in the tsd_report step. Can you please pull the latest version and try again?

xxYaaoo commented 7 months ago

Hi~

This is my nextflow.config file. My latest try did not change the content of config file and met the error I showed above. image Sure! I will pull the latest version and try again!

Thank you so much!

cgroza commented 7 months ago

I see you were on an older version of the config.

Also try this nextflow.config:

manifest.defaultBranch = 'main'
singularity.enabled = true
singularity.autoMounts = true
singularity.runOptions = '--contain --bind $(pwd):/tmp'

profiles {
    standard {
        process.executor = 'local'
        process.container = 'library://cgroza/collection/graffite:latest'
    }

    cluster {
        process.executor = 'slurm'
        process.container = 'library://cgroza/collection/graffite:latest'
        process.scratch = '$SLURM_TMPDIR'
    }

    cloud {
        process.executor = 'aws'
        process.container = 'library://cgroza/collection/graffite:latest'
    }

}
xxYaaoo commented 7 months ago

OK, thank you for your help! !

Do I need to add the 'export NXF_TEMP=’ in my slurm script while use this new nextflow.config?

cgroza commented 7 months ago

I don't touch that variable when I run nextflow on my cluster. However, it may be different for you. Try without first.

xxYaaoo commented 7 months ago

Ok, really appreciate your help!

clemgoub commented 6 months ago

Hi @xxYaaoo, do you still have problem with this issue? Let us know if you need further assistance!

amnghn commented 4 months ago

Hi @cgroza and @clemgoub, I've been struggling with the same tmp dir problem. I checked the issues #8, #12, #31, #24 and the "important-note" but couldn't figure out how to solve it. Here is the command I'm using to run GraffiTE on our slurm cluster.

nextflow run /lisc/scratch/botany/amin/te_detection/pME/GraffiTE/main.nf \
    --vcf /lisc/scratch/botany/amin/te_detection/pME/test_run/results/1_SV_search/svim-asm_variants.vcf \
    --reference input/vieillardii1167c.asm.bp.p_ctg.fa \
    --TE_library input/vieillardii.fasta.mod.EDTA.TElib.fa \
    --out results \
    --genotype false \
    -profile cluster \
    -with-report reports/report_${SLURM_JOB_ID}.html \
    -resume

I used --vcf instead of --assemblies as @clemgoub explained here.

And here is the nextflow.config file

manifest.defaultBranch = 'main'
singularity.enabled = true
singularity.autoMounts = true
singularity.runOptions = '--contain --bind /lisc/scratch/botany/amin/te_detection/pME/test_run/temp_dir:/tmp'

profiles {
    standard {
        process.executor = 'local'
        process.container = '/lisc/scratch/botany/amin/te_detection/pME/graffite_latest.sif'
    }

    cluster {
        process.executor = 'slurm'
        process.container = '/lisc/scratch/botany/amin/te_detection/pME/graffite_latest.sif'
        process.scratch = '$SLURM_TMPDIR'
    }

    cloud {
        process.executor = 'aws'
        process.container = '/lisc/scratch/botany/amin/te_detection/pME/graffite_latest.sif'
    }

}

temp_dir is writable and is used by the repeatmask_VCF process (I checked this while this job was running; there were a lot of tmp file in it). temp_dir has currently three empty subdirectories: nxf.j8zh7vIZHc, slurm-2228294 and slurm-2297514. The last one is the one that repeatmask_VCF process used.

I ran the pipeline with changing process.scratch = '$SLURM_TMPDIR' to process.scratch = '/lisc/scratch/botany/amin/te_detection/pME/test_run/temp_dir' in the nextflow.config file but I got the exact same error. Also singularity.runOptions = '--contain --bind $(pwd):/tmp' did not help, either!

The pipeline stops running about half an hour after submittingtsd_prep process and doesn't generate the 3_TSD_search directory.

These are the last lines in the .nextflow.log file

~> TaskHandler[jobId: 2297514; id: 1; name: repeatmask_VCF (1); status: RUNNING; exit: -; error: -; workDir: /lisc/scratch/botany/amin/te_detection/pME/test_run/work/50/98452e410be717b0d27a72b3705134 started: 1720609826603; exited: -; ]
Jul-10 21:17:29.919 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 2297514; id: 1; name: repeatmask_VCF (1); status: COMPLETED; exit: 0; error: -; workDir: /lisc/scratch/botany/amin/te_detection/pME/test_run/work/50/98452e410be717b0d27a72b3705134 started: 1720609826603; exited: 2024-07-10T19:17:28Z; ]
Jul-10 21:17:29.927 [Task monitor] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'TaskFinalizer' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Jul-10 21:17:30.511 [TaskFinalizer-1] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'PublishDir' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Jul-10 21:17:31.302 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [SLURM] submitted process tsd_prep (1) > jobId: 2299973; workDir: /lisc/scratch/botany/amin/te_detection/pME/test_run/work/3d/2b7f19fd8b29d774c934fbaa358251
Jul-10 21:17:31.304 [Task submitter] INFO  nextflow.Session - [3d/2b7f19] Submitted process > tsd_prep (1)
Jul-10 21:18:04.894 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 2299973; id: 2; name: tsd_prep (1); status: COMPLETED; exit: 0; error: -; workDir: /lisc/scratch/botany/amin/te_detection/pME/test_run/work/3d/2b7f19fd8b29d774c934fbaa358251 started: 1720639059895; exited: 2024-07-10T19:18:01Z; ]
Jul-10 21:18:05.002 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-10 21:18:09.889 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm) - terminating tasks monitor poll loop
Jul-10 21:18:09.891 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-10 21:18:09.908 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'TaskFinalizer' shutdown completed (hard=false)
Jul-10 21:18:09.925 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Jul-10 21:18:09.977 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=2; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=5d 9h 51m 38s; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=16; peakMemory=20 GB; ]
Jul-10 21:18:09.979 [main] DEBUG nextflow.trace.ReportObserver - Workflow completed -- rendering execution report
Jul-10 21:18:19.223 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Jul-10 21:18:19.489 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Jul-10 21:18:19.510 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Here is the .command.log in the /work/3d

##LiSC job info: the temporary directory of your job is also available read-only until 3 days after job end on the login nodes (login01/login02) under this path: /lisc/slurm/node-b07/tmp/slurm-2299973
##LiSC job info: Temporary folders of finished jobs are offline when their compute node went into power-saving sleep. For access to these folders, please contact the helpdesk.
INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
INFO:    gocryptfs not found, will not be able to use gocryptfs
extracting flanking...
sort: cannot create temporary file in '/tmp/slurm-2299973': No such file or directory
index file vieillardii1167c.asm.bp.p_ctg.fa.fai not found, generating...
extracting SVs' 5' and 3' ends...
sort: cannot create temporary file in '/tmp/slurm-2299973': No such file or directory

I would be grateful if you could help me fix this issue.

clemgoub commented 4 months ago

Hello @amnghn ! I'm really sorry you are stuck with this mktemp error.

I'm looking forward to hear about @cgroza opinion. Do you have an empty VCF after the RepeatMasker process? Often mktemp will fail at this stage but the pipeline keeps going until the TSD process, and then crashes.

Could you send us the complete .command.log and .command.err for the RepeatMasker and TSD processes?

Meanwhile, have you tried to run with the standard Nextflow profile? Since the main task of your job is RepeatMasker, this shouldn't affect the speed much.

Also, if you haven't, I'd try to see with your system admins if the process.scratch = variable carries over to the node the process is dispatched. Perhaps it is interpreted on the shell/node where you run the main command, but not on the shell/node that runs the process.

Thanks,

Clément

amnghn commented 4 months ago

Hi @clemgoub, Thanks a lot for your reply. I finally managed to fix this issue by changing process.scratch = '$SLURM_TMPDIR' to process.scratch = '$TMPDIR'. In our cluster, SLURM_ should be omitted. I'm very glad that I got the final GraffiTE.merged.genotypes.vcf.gz and all the individual VCF files.

The VCF file generated by repeatmasker was not empty even when I had issues with the TSD processes.

Thanks a lot for developing this great pipeline. This was a test run (3 species, 48 samples), I'm planning to run it on 370 individuals of ca. 30 species

clemgoub commented 4 months ago

Amazing! Thanks a lot for your kind word and sharing your solution! I'm sure it'll help more users as well!

Cheers,

Clément