NBISweden / GenErode

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.
GNU General Public License v3.0
21 stars 7 forks source link

pipeline stuck at 'Building DAG of jobs' #47

Closed ndussex closed 1 year ago

ndussex commented 1 year ago

Hi,

After modifying a rule (common.smk) to run GERP scores and to split the genomes in more chunks so that the step run more quickly, the pipeline (dry or main run) gets stuck to

'Config file config/config.yaml is extended by additional config specified via the command line. Building DAG of jobs...'

The only job that is run is the job splitting the genomes into chunk*.bed files and this is done as modified in my rule (i.e. divide the genome into 2000 instead of 200 chunks). The first time I did this, I had to cancel the run, delete the 'my_Ref_path/gerp' directtory and restart it as I noticed that those bed files were still in the original set up (i.e. 200 chunks). But other than that, the pipeline was running as it was supposed to. But now, the run gets stuck as above.

So, I the did the following to try and fix it:

The only file that I edited was common.smk, so I am not sure what else could have gone wrong. Is there any other file/logs I should delete or something else I could check to try and fix this?

Thanks a lot! Nic

verku commented 1 year ago

Hi Nic! The reason is that the DAG is getting very big with 2,000 chunks - some rules in the GERP step are run per sample and chunk, so if you have 10 samples each of these rules are now run 20,000 times. If the DAG is built at some point, Snakemake submits all of these as jobs to the slurm queue (with -j 100 it would submit 100 jobs at a time).

I remember you had a genome assembled to chromosome-level with several thousand unplaced contigs. So with 2,000 chunks you'd have a long waiting time to resolve the DAG, and then long waiting times until all the jobs have queued and were run. Sorry that I didn't consider this when suggesting to increase the number of chunks.

A final suggestion would be to run GERP only for the chromosomes, i.e. to remove all unplaced scaffolds from the reference fasta file. With the original code (200 chunks), this would place each chromosome into a separate chunk so run times would be acceptable. If you want to avoid re-mapping, you could try to remove the unplaced scaffolds from the repeat-masked per-sample BCF files:

"results/{dataset}/vcf/" + REF_NAME + "/{sample}.merged.rmdup.merged.{processed}.snps5.noIndel.QUAL30.dp.AB.repma.bcf"

and the corresponding index files:

"results/{dataset}/vcf/" + REF_NAME + "/{sample}.merged.rmdup.merged.{processed}.snps5.noIndel.QUAL30.dp.AB.repma.bcf.csi"

I would make a copy of them as backup, and then remove the unplaced scaffolds, keeping the same paths and file names so that Snakemake finds them and does not attempt to re-run previous rules.

ndussex commented 1 year ago

Hi Verena,

Sorry, I forgot to say that I tried to revert back to 200 (and deleting the gerp dir), but yesterday it stayed idle for a few hours so I gave up. Now it started after a few minutes.

I thought about the option of removing the unplaced scaffolds. So I could do that. Would it be possible to use a 'shortcut' and remove the chunks for all the unplaced scaffolds? And can I do this on the merged vcf/bcf file with all my genomes?

Thanks!

verku commented 1 year ago

Hi!

I realised that you'd have to modify more files and that this will unfortunately trigger a re-run of the pipeline:

GERP is taking the repeat-masked per-sample BCF files as input, it does not use the merged VCF/BCF file. This would be fine, but you will also need to remove the unplaced scaffolds from the reference genome fasta file. Even if you keep the same file name, the time stamp would be updated which will cause Snakemake to re-run everything based on the reference genome fasta file.

So if you decide to remove the unplaced scaffolds, I'd suggest to re-run the pipeline from the start with only the GERP step set to "True". If you decide to do that in the same directory as your previous pipeline runs, you'd have to rename the reference genome fasta file without the unplaced scaffolds, so that GenErode does not overwrite your results based on the full reference genome fasta file.

ndussex commented 1 year ago

Hi,

Yes, I also thought that I would have to fix and remove those scaffolds in several files. I should have thought about it earlier and remove those ~5000 unplaced scaffolds at the beginning.

it seems to be running with 250 chunks and I should be able to run those jobs <8-10 days. So that will do for this time.

Thanks again!

verku commented 1 year ago

I'm glad to hear that you found a solution, fingers crossed for your remaining analyses!