NBChub / bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
https://github.com/NBChub/bgcflow/wiki
MIT License
27 stars 7 forks source link

Error in Rule Roary_Out on Personal Dataset - branch 0.6.1 #239

Closed andrekind17 closed 1 year ago

andrekind17 commented 1 year ago

Hi, I runned roary on the example dataset and it was successful. On my personal dataset, instead, I get this error for roary out (find attached the specific log. It seems the pan_genome_reference.fa file cannot be located): roary-out-Actinoallomurus_86_Genomes.log

Activating conda environment: .snakemake/conda/784396d771d3e2189839e7baa154b27d_ [Wed Apr 26 19:26:37 2023] Finished job 1236. 1 of 4 steps (25%) done Select jobs to execute...

[Wed Apr 26 19:26:39 2023] rule roary_out: input: data/interim/roary/Actinoallomurus_86_Genomes, data/processed/Actinoallomurus_86_Genomes/automlst_wrapper output: data/processed/Actinoallomurus_86_Genomes/roary, data/processed/Actinoallomurus_86_Genomes/roary/df_gene_presence_binary.csv log: workflow/report/logs/roary/roary-out-Actinoallomurus_86_Genomes.log jobid: 1235 reason: Missing output files: data/processed/Actinoallomurus_86_Genomes/roary/df_gene_presence_binary.csv; Input files updated by another job: data/interim/roary/Actinoallomurus_86_Genomes wildcards: name=Actinoallomurus_86_Genomes resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/03517672abe9c665423eb3b1c199390f_ [Wed Apr 26 19:26:54 2023] Error in rule roary_out: jobid: 1235 input: data/interim/roary/Actinoallomurus_86_Genomes, data/processed/Actinoallomurus_86_Genomes/automlst_wrapper output: data/processed/Actinoallomurus_86_Genomes/roary, data/processed/Actinoallomurus_86_Genomes/roary/df_gene_presence_binary.csv log: workflow/report/logs/roary/roary-out-Actinoallomurus_86Genomes.log (check log file(s) for error details) conda-env: /bigdata/home/WIN.DTU.DK/gentile/bgcflow/.snakemake/conda/03517672abe9c665423eb3b1c199390f shell:

    python workflow/bgcflow/bgcflow/data/make_pangenome_dataset.py data/interim/roary/Actinoallomurus_86_Genomes data/processed/Actinoallomurus_86_Genomes/roary data/processed/Actinoallomurus_86_Genomes/automlst_wrapper 2>> workflow/report/logs/roary/roary-out-Actinoallomurus_86_Genomes.log

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
OmkarSaMo commented 1 year ago

Hi @Androx0, roary has a limit of 50000 genes in the pagenome. I think your dataset is very diverse and thus has detected more gene clusters in the pangenome.

By default if there are more than 50,000 clusters, Roary will not create the core alignment. You can increase the maximum number of allowed clusters by using the –g parameter (-g 100000). You can check the roary log file to see how many clusters were detected in your dataset.

andrekind17 commented 1 year ago

Hi, I have modified the g parameter to 150000 in the roary.smk file, but when I launch the rule I still get the exact same error. I can also see that no changes are made both in the roary.log and in the roary-out.log. And, the roary.log the first time ended with this error message:

"Number of clusters (105350) exceeds limit (60000). Multifastas not created. Please check the spreadsheet for contamination from different species or increase the --group_limit parameter. 2023/04/26 19:26:27 Exiting early because number of clusters is too high"

I guess the job is being read as already failed and the command cannot overwrite the tasks to do. Do you know anything I could try to adjust this problem without rerunning everything from the beginning? Because all the processing already required a few days.

Find attached the logs. roary-Actinoallomurus_86_Genomes.log roary-out-Actinoallomurus_86_Genomes.log

OmkarSaMo commented 1 year ago

Hi @Androx0, I assume you tried to change the parameter here and that is enough. https://github.com/NBChub/bgcflow/blob/b8afea678eb8d8fe44642dff4d1e320836ba3040/workflow/rules/roary.smk#L10

The job was done in prior run of the rule roary which created many of the important files and thus it will not be rerun by default.

You will need to force rerun of the rule using snakemake. Alternatively, you can just delete the data/interim/roary/Actinoallomurs_86 folder. Deleting this folder will inform snakemake that this rule also needs to be run.

Always try bgcflow run -n to see which rules are planned - and here you should get roary as well.

andrekind17 commented 1 year ago

Thanks, would you have a suggestion for the command to use to force the rerun using snakemake? Because deleting the data/interim/roary/Actinoallomurs_86 folder will make the job running again from the beginning and I would like to avoid that if it is possible

andrekind17 commented 1 year ago

Thanks, it worked by rerunning roary from the beginning with g=150000!