cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121 stars 6 forks source link

Process `tsd_search` terminated even with enough memory #46

Closed amnghn closed 1 week ago

amnghn commented 2 weeks ago

Hi, I used GraffiTE to find pMEs across three genomes, and it worked smoothly. Recently, I started to include more input data (11 genome assemblies and a reference genome), and consequently, I allocated more resources. However, I am getting an error in the tsd_search process even with more than enough allocated CPU and RAM.

Nov-06 14:54:59.774 [TaskFinalizer-2] ERROR nextflow.processor.TaskProcessor - Error executing process > 'tsd_search (12)'

Caused by:
  Process `tsd_search (12)` terminated with an error exit status (140)

Command executed:

  cp repeatmasker_dir/repeatmasker_dir/* .
  TSD_Match_v2.sh SV_sequences_L_R_trimmed_WIN.fa flanking_sequences.fasta input.1

Command exit status:
  140

Command output:
  (empty)

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  INFO:    gocryptfs not found, will not be able to use gocryptfs

Work dir:
  /lisc/scratch/botany/amin/te_detection/pME/2nd_run/work/8c/24d8f5de7d0e0ab8105ec41a6bf048

Container:
  /lisc/scratch/botany/amin/te_detection/pME/graffite_latest.sif

As far as I know, error exit status (140) indicates low resource allocation, but some tsd_search processes finish with the same resources. I can change the memory with params.tsd_memory in the main.nf file, but there are no parameters to modify the time and CPU (or I don't see it!). Monitoring the resource usage (HTML report generated by the pipeline) shows that those tsd_search processes shorter than 1 hour are completed successfully, but as soon as they reach 59 min, they all are aborted by the slurm cluster.

So, first, I wanted to ask what is causing the error. Second, if this is the allocated time, how can I modify the allocated time for the tsd_search process?

Thank you so much in advance for your help.

clemgoub commented 2 weeks ago

Hello @amnghn,

I think you are on the right track indeed. I looked at the code and yes there is no option to increase the default allocation in time for the TSD process. I will be adding this now and will let you know when you can pull the repo and try again!

Cheers,

Clément

clemgoub commented 2 weeks ago

Dear Amin,

I have updated the repo with a new --tsd_time parameter (default 1h). Alternatively, you can reduce the batch size for the TSD analysis --tsd_batch_size (default is 100 SV), which should reduce the time spent by each of the parallel TSD processes.

Let me know if this solve your issue!

Cheers,

Clément

amnghn commented 2 weeks ago

Dear @clemgoub, Thanks a lot for the prompt reply and updating this great pipeline!

I pulled the repo and resumed the run using this command:

nextflow run /lisc/scratch/botany/amin/te_detection/pME/GraffiTE/main.nf \
    --vcf /lisc/scratch/botany/amin/te_detection/pME/2nd_run/results/1_SV_search/svim-asm_variants.vcf \
    --reference input/vieillardii1167c.asm.bp.p_ctg.fa \
    --TE_library input/vieillardii.fasta.mod.EDTA.TElib.fa \
    --out results \
    --genotype false \
    -profile cluster \
    -with-report reports/report_${SLURM_JOB_ID}.html \
    -resume

However, I noticed that it started to run the pipeline from the repeat masker process. I already had the second directory (2_Repeat_Filtering) in the results folder with all the sub-directories, and I wanted to resume from the TSD part. Last time, repeat masking took three days, and I didn't want to repeat it again. Anyway, it's running now, and I'll let you know whether it will successfully finish the TSD search with params.tsd_time = "3h" or not; hopefully, it will finish by Monday.

Best, Amin

clemgoub commented 2 weeks ago

Dear Amin,

Thanks for you comment. I agree this is annoying. There is actually an option for that but this is not documented because not extensively tested. So, if you have all of your 2_Repeat_Filtering outputs, you can start at the TSD process with:

--RM_dir <output_dir>/2_Repeat_Filtering/repeatmasker_dir  --RM_vcf <output_dir>/2_Repeat_Filtering/genotypes_repmasked_filtered.vcf

as well as the other options. No need to put -resume with these option. It will work as an alternative way to input.

Let me know if you have any issue!

Cheers,

Clément

amnghn commented 2 weeks ago

Dear @clemgoub, Thanks a lot. The pipeline was completed successfully with the new tsd parameter. It generated the pangenome. vcf and all other files (up to genotyping). Some tsd_search processes took up to 2h 50min!

It took about two days from the beginning of the repeat masking until the end of the tsd search. My previous comment about the length of the repeat masking process was not accurate. I'll send you the HTML report via email.

clemgoub commented 1 week ago

Thank you for letting us know, this is useful information! I'm gland you got the pipeline to work for you!

Cheers,

Clément