Open umasstr opened 3 years ago
hmm... yeah, I can't really seem to find anything wrong with what you have, either This might sound obvious, but I can't think of anything else: could you try checking that the samples file is tab-separated?
cat -t DATA/data/samples.tsv
Also, it might help debug things if you comment out the SAMP_NAMES
config option. Then you'll be able to see what it's parsing the samples as.
Also, regarding the output directory being ignored: It's because I've set up the run.bash
script to override the output directory that you specify in the config file. So you should specify the output directory there rather than in the config.
I know this is confusing and unintuitive, but at the time, it was the only way to put the log files in the same directory as the output. I've already changed the way this is handled in a future release coming soon!
Thanks for the quick reply. Tabs look okay to me:
2294^IDATA/2294.dup.fix.bam^IDATA/2294.bed 2296^IDATA/2296.dup.fix.bam^IDATA/2296.bed
I'm not very familiar with snakemake, so trying to execute snakefile line-by-line to debug provided limited insight:
FWIW, the pipeline ran when I moved all required files into the same varCA subdirectory, "data" ("DATA" prompted a similar error message), made sample names non-numerical, and removed "." not preceding required extensions (i.e. name.extension)
Unfortunately, the run failed, but I'm guessing this isn't related?
[Thu Sep 9 13:58:39 2021] rule vcf2tsv: input: out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz, out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi output: out/callers/Two/illumina-strelka/illumina-strelka.cd856d8a.tsv jobid: 81 wildcards: sample=Two, caller=illumina-strelka, hash=cd856d8a Activating conda environment: /varCA/.snakemake/conda/e6058793 Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz. Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi. [Thu Sep 9 13:58:43 2021] Finished job 81. 48 of 157 steps (31%) done Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz.tbi. Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz. [Thu Sep 9 13:58:48 2021] Finished job 75. 49 of 157 steps (31%) done Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz.tbi. Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz. [Thu Sep 9 13:59:16 2021] Finished job 67. 50 of 157 steps (32%) done Exiting because a job execution failed. Look above for error message
ah, I just thought of something! When you write 2294 in a YAML file, it automatically gets parsed as an integer. But 2294 is getting parsed as a string when it's being read from the samples file. So you just needed to put quotes around the sample names in the config file:
SAMP_NAMES: ['2294', '2296']
instead of
SAMP_NAMES: [2294, 2296]
It probably would have also worked if you had commented out the SAMP_NAMES
config option completely:
# SAMP_NAMES: [2294, 2296]
I moved all required files into the same varCA subdirectory
I'm not sure why it helped that you moved all of the files into the same subdirectory. The paths in the samples file are supposed to be interpreted relative to the directory that you execute the pipeline from.
There might be an issue string splitting paths containing nested directories or filenames with multiple "."
removed "." not preceding required extensions (i.e. name.extension)
I'm also not sure why it's having trouble with the multiple dots "." in your file names. I've looked through the code again, and there's nothing I can think of that would affect that. In fact, if anything, I think it should fail if you're missing the dot before the extension:
https://github.com/aryarm/varCA/blob/fcf59091a2032fa0126b446991d68de6683f73e3/rules/prepare.smk#L92
If you have DATA/2294.dup.fixbam
instead of DATA/2294.dup.fix.bam
in your samples file, it will treat the file as a FASTQ file instead of a BAM file. I'll keep thinking about it and let you know if I come up with something.
object "config" may not be defined at the sample indexing step in snakefile
The config dictionary should be defined as early as this line: https://github.com/aryarm/varCA/blob/fcf59091a2032fa0126b446991d68de6683f73e3/Snakefile#L7
Unfortunately, the run failed, but I'm guessing this isn't related?
Without more info, I can't be sure what the cause is. Are there any error messages in the qlog
file?
Regarding the formatting issues, I will try to re-run using your suggestions after troubleshooting the issues below. For now, the below samples worked well:
Two data/two.bam data/two.bed Three data/three.bam data/three.bed
Regarding the run failure, I don't see any obvious errors in qlog. I ran twice and the same step (51) failed both times.
(snakemake) root@9fadc1b19969:/varCA# tail -50 out/qlog
sorting RP complete. Reads_RP.size(): 100 sorting read-pair sorting read-pair finished. Modify RP complete. adding BD from RP. modify and summarize interchr RP. adding BD from interChr RP. summarize BP as BD complete. Now start sorting BD... sorting BD... done. external BD events: 0 Added BreakDancer-like events: 0 The number of one end mapped read: 41630 The number of one end mapped read: 42503 The number of one end mapped read: 42503 There are 268098 reads supporting the reference allele. There are 1 samples. SampleName2Index done declaring g_RefCoverageRegion for 1 samples and 20568 positions. There are 0 split-reads mapped by aligner. Added: Sorting and outputing deletions with non-template sequences ... Successfully created workflow run script. To execute the workflow, run the following script and set appropriate options: /varCA/out/callers/Three/illumina-strelka/runWorkflow.py Successfully created workflow run script. To execute the workflow, run the following script and set appropriate options: /varCA/out/callers/Two/illumina-strelka/runWorkflow.py
hmm... yeah, I can't tell which job is the job that is failing from the log outputs that you included.
Here are some debugging strategies:
rror
or using less
to search for it with the forward slash. (Note that I recommend rror
instead of Error
, in case the error is mentioned with a lowercase e error
.)run.bash
with the -npr
options to see what shell code it is running. Then, you can activate the conda environment that it is using and run the shell code manually, instead of letting Snakemake do it for you. That way, it will be easier to debug the code and catch any errors or warnings.One of the requirements for the BAM files is that they have read-group information? Do yours?
Hi @aryarm,
Just getting back to this. The pipeline runs past the previously mentioned errors by adding read groups. Thanks for the tips!
Unfortunately, I'm confronted with a more cryptic error later on. Log here
It looks like bcftools may not be installed as a dependency? Total runtime is a few hours per sample, so I wanted to check in to see if there was anything else I'm missing in the log.
Thanks!
After preparing the required input, the pipeline can't seem to find the specified files or output directory. I don't see in the log files whether or not my sample file is recognized. I am hoping that there is an obvious issue with my file paths or the config, but I'm just not seeing it. Any help would be much appreciated.
Also, sample data ran to completion.
Log file:
(snakemake) root@c844f1072fc5:/varCA# cat out/*
My config (note that the output directory I specify is ignored):
(snakemake) root@c844f1072fc5:/varCA# grep -vF '#' configs/config.yaml | sed '/^$/d'
Sample file:
(snakemake) root@c844f1072fc5:/varCA# cat DATA/data/samples.tsv
BAM and bed files referenced in samples.tsv are present:
(snakemake) root@c844f1072fc5:/varCA# ls DATA/229* | xargs -n 1 basename
As are indexes:
root@c844f1072fc5:/varCA# ls DATA/bwa | xargs -n 1 basename
And models:
(snakemake) root@c844f1072fc5:/varCA# ls DATA/data | xargs -n 1 basename