Closed amnghn closed 1 week ago
Hello @amnghn,
I think you are on the right track indeed. I looked at the code and yes there is no option to increase the default allocation in time for the TSD process. I will be adding this now and will let you know when you can pull the repo and try again!
Cheers,
Clément
Dear Amin,
I have updated the repo with a new --tsd_time
parameter (default 1h).
Alternatively, you can reduce the batch size for the TSD analysis --tsd_batch_size
(default is 100 SV), which should reduce the time spent by each of the parallel TSD processes.
Let me know if this solve your issue!
Cheers,
Clément
Dear @clemgoub, Thanks a lot for the prompt reply and updating this great pipeline!
I pulled the repo and resumed the run using this command:
nextflow run /lisc/scratch/botany/amin/te_detection/pME/GraffiTE/main.nf \
--vcf /lisc/scratch/botany/amin/te_detection/pME/2nd_run/results/1_SV_search/svim-asm_variants.vcf \
--reference input/vieillardii1167c.asm.bp.p_ctg.fa \
--TE_library input/vieillardii.fasta.mod.EDTA.TElib.fa \
--out results \
--genotype false \
-profile cluster \
-with-report reports/report_${SLURM_JOB_ID}.html \
-resume
However, I noticed that it started to run the pipeline from the repeat masker process. I already had the second directory (2_Repeat_Filtering
) in the results folder with all the sub-directories, and I wanted to resume from the TSD part. Last time, repeat masking took three days, and I didn't want to repeat it again. Anyway, it's running now, and I'll let you know whether it will successfully finish the TSD search with params.tsd_time = "3h"
or not; hopefully, it will finish by Monday.
Best, Amin
Dear Amin,
Thanks for you comment. I agree this is annoying. There is actually an option for that but this is not documented because not extensively tested. So, if you have all of your 2_Repeat_Filtering
outputs, you can start at the TSD process with:
--RM_dir <output_dir>/2_Repeat_Filtering/repeatmasker_dir --RM_vcf <output_dir>/2_Repeat_Filtering/genotypes_repmasked_filtered.vcf
as well as the other options. No need to put -resume
with these option. It will work as an alternative way to input.
Let me know if you have any issue!
Cheers,
Clément
Dear @clemgoub,
Thanks a lot. The pipeline was completed successfully with the new tsd parameter. It generated the pangenome. vcf
and all other files (up to genotyping). Some tsd_search processes took up to 2h 50min!
It took about two days from the beginning of the repeat masking until the end of the tsd search. My previous comment about the length of the repeat masking process was not accurate. I'll send you the HTML report via email.
Thank you for letting us know, this is useful information! I'm gland you got the pipeline to work for you!
Cheers,
Clément
Hi, I used GraffiTE to find pMEs across three genomes, and it worked smoothly. Recently, I started to include more input data (11 genome assemblies and a reference genome), and consequently, I allocated more resources. However, I am getting an error in the tsd_search process even with more than enough allocated CPU and RAM.
As far as I know,
error exit status (140)
indicates low resource allocation, but some tsd_search processes finish with the same resources. I can change the memory withparams.tsd_memory
in the main.nf file, but there are no parameters to modify the time and CPU (or I don't see it!). Monitoring the resource usage (HTML report generated by the pipeline) shows that those tsd_search processes shorter than 1 hour are completed successfully, but as soon as they reach 59 min, they all are aborted by the slurm cluster.So, first, I wanted to ask what is causing the error. Second, if this is the allocated time, how can I modify the allocated time for the tsd_search process?
Thank you so much in advance for your help.