aryarm / varCA

Use an ensemble of variant callers to call variants from ATAC-seq data
MIT License
23 stars 8 forks source link

Not all of the samples requested have provided input #40

Open umasstr opened 2 years ago

umasstr commented 2 years ago

After preparing the required input, the pipeline can't seem to find the specified files or output directory. I don't see in the log files whether or not my sample file is recognized. I am hoping that there is an obvious issue with my file paths or the config, but I'm just not seeing it. Any help would be much appreciated.

Also, sample data ran to completion.

Log file: (snakemake) root@c844f1072fc5:/varCA# cat out/*

/varCA/Snakefile:51: UserWarning: Not all of the samples requested have provided input. Proceeding with as many samples as is possible... rule all: Building DAG of jobs... Using shell: /bin/bash Provided cores: 8 Rules claiming more threads will be scaled down. Singularity containers: ignored Job counts: count jobs 1 all 1 [Thu Sep 9 11:59:06 2021] localrule all: jobid: 0 [Thu Sep 9 11:59:06 2021] Finished job 0. 1 of 1 steps (100%) done Complete log: /varCA/.snakemake/log/2021-09-09T115906.454263.snakemake.log

My config (note that the output directory I specify is ignored):

(snakemake) root@c844f1072fc5:/varCA# grep -vF '#' configs/config.yaml | sed '/^$/d'

sample_file: DATA/data/samples.tsv SAMP_NAMES: [2294, 2296] genome: DATA/bwa/genome.fa out: DATA/out_01 snp_callers: [gatk-snp, varscan-snp, vardict-snp] indel_callers: [gatk-indel, varscan-indel, vardict-indel, pindel, illumina-strelka] snp_filter: ['gatk-snp\~DP>10'] indel_filter: ['gatk-indel~DP>10'] snp_model: DATA/data/snp.rda indel_model: DATA/data/indel.rda

Sample file:

(snakemake) root@c844f1072fc5:/varCA# cat DATA/data/samples.tsv

2294 DATA/2294.dup.fix.bam DATA/2294.bed 2296 DATA/2296.dup.fix.bam DATA/2296.bed

BAM and bed files referenced in samples.tsv are present:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/229* | xargs -n 1 basename

2294.bam 2294.bam.bai 2294.bed 2294.dup.bam 2294.dup.bam.bai 2294.dup.fix.bam 2294.dup.fix.bam.bai 2294_peaks.narrowPeak 2296.bam 2296.bam.bai 2296.bed 2296.dup.bam 2296.dup.bam.bai 2296.dup.fix.bam 2296.dup.fix.bam.bai 2296_peaks.narrowPeak

As are indexes:

root@c844f1072fc5:/varCA# ls DATA/bwa | xargs -n 1 basename

genome.dict genome.fa genome.fa.amb genome.fa.ann genome.fa.bwt genome.fa.fai genome.fa.pac genome.fa.sa

And models:

(snakemake) root@c844f1072fc5:/varCA# ls DATA/data | xargs -n 1 basename

README.md indel.rda indel.tsv.gz samples.tsv snp.rda snp.tsv.gz

aryarm commented 2 years ago

hmm... yeah, I can't really seem to find anything wrong with what you have, either This might sound obvious, but I can't think of anything else: could you try checking that the samples file is tab-separated?

cat -t DATA/data/samples.tsv

Also, it might help debug things if you comment out the SAMP_NAMES config option. Then you'll be able to see what it's parsing the samples as.

Also, regarding the output directory being ignored: It's because I've set up the run.bash script to override the output directory that you specify in the config file. So you should specify the output directory there rather than in the config. I know this is confusing and unintuitive, but at the time, it was the only way to put the log files in the same directory as the output. I've already changed the way this is handled in a future release coming soon!

umasstr commented 2 years ago

Thanks for the quick reply. Tabs look okay to me:

2294^IDATA/2294.dup.fix.bam^IDATA/2294.bed 2296^IDATA/2296.dup.fix.bam^IDATA/2296.bed

I'm not very familiar with snakemake, so trying to execute snakefile line-by-line to debug provided limited insight:

  1. There might be an issue string splitting paths containing nested directories or filenames with multiple "."
  2. line 69 in snakefile may not tolerate my use of numerical samples (i.e. adding # to characters)
  3. object "config" may not be defined at the sample indexing step in snakefile

FWIW, the pipeline ran when I moved all required files into the same varCA subdirectory, "data" ("DATA" prompted a similar error message), made sample names non-numerical, and removed "." not preceding required extensions (i.e. name.extension)

Unfortunately, the run failed, but I'm guessing this isn't related?

[Thu Sep 9 13:58:39 2021] rule vcf2tsv: input: out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz, out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi output: out/callers/Two/illumina-strelka/illumina-strelka.cd856d8a.tsv jobid: 81 wildcards: sample=Two, caller=illumina-strelka, hash=cd856d8a Activating conda environment: /varCA/.snakemake/conda/e6058793 Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz. Removing temporary output file out/callers/Two/illumina-strelka/illumina-strelka.vcf.gz.tbi. [Thu Sep 9 13:58:43 2021] Finished job 81. 48 of 157 steps (31%) done Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz.tbi. Removing temporary output file out/callers/Two/varscan-indel/varscan-indel.vcf.gz. [Thu Sep 9 13:58:48 2021] Finished job 75. 49 of 157 steps (31%) done Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz.tbi. Removing temporary output file out/callers/Two/varscan-snp/varscan-snp.vcf.gz. [Thu Sep 9 13:59:16 2021] Finished job 67. 50 of 157 steps (32%) done Exiting because a job execution failed. Look above for error message

aryarm commented 2 years ago

ah, I just thought of something! When you write 2294 in a YAML file, it automatically gets parsed as an integer. But 2294 is getting parsed as a string when it's being read from the samples file. So you just needed to put quotes around the sample names in the config file:

SAMP_NAMES: ['2294', '2296']

instead of

SAMP_NAMES: [2294, 2296]

It probably would have also worked if you had commented out the SAMP_NAMES config option completely:

# SAMP_NAMES: [2294, 2296]

I moved all required files into the same varCA subdirectory

I'm not sure why it helped that you moved all of the files into the same subdirectory. The paths in the samples file are supposed to be interpreted relative to the directory that you execute the pipeline from.

There might be an issue string splitting paths containing nested directories or filenames with multiple "."

removed "." not preceding required extensions (i.e. name.extension)

I'm also not sure why it's having trouble with the multiple dots "." in your file names. I've looked through the code again, and there's nothing I can think of that would affect that. In fact, if anything, I think it should fail if you're missing the dot before the extension: https://github.com/aryarm/varCA/blob/fcf59091a2032fa0126b446991d68de6683f73e3/rules/prepare.smk#L92 If you have DATA/2294.dup.fixbam instead of DATA/2294.dup.fix.bam in your samples file, it will treat the file as a FASTQ file instead of a BAM file. I'll keep thinking about it and let you know if I come up with something.

object "config" may not be defined at the sample indexing step in snakefile

The config dictionary should be defined as early as this line: https://github.com/aryarm/varCA/blob/fcf59091a2032fa0126b446991d68de6683f73e3/Snakefile#L7

Unfortunately, the run failed, but I'm guessing this isn't related?

Without more info, I can't be sure what the cause is. Are there any error messages in the qlog file?

umasstr commented 2 years ago

Regarding the formatting issues, I will try to re-run using your suggestions after troubleshooting the issues below. For now, the below samples worked well:

Two data/two.bam data/two.bed Three data/three.bam data/three.bed

Regarding the run failure, I don't see any obvious errors in qlog. I ran twice and the same step (51) failed both times. (snakemake) root@9fadc1b19969:/varCA# tail -50 out/qlog

sorting RP complete. Reads_RP.size(): 100 sorting read-pair sorting read-pair finished. Modify RP complete. adding BD from RP. modify and summarize interchr RP. adding BD from interChr RP. summarize BP as BD complete. Now start sorting BD... sorting BD... done. external BD events: 0 Added BreakDancer-like events: 0 The number of one end mapped read: 41630 The number of one end mapped read: 42503 The number of one end mapped read: 42503 There are 268098 reads supporting the reference allele. There are 1 samples. SampleName2Index done declaring g_RefCoverageRegion for 1 samples and 20568 positions. There are 0 split-reads mapped by aligner. Added: Sorting and outputing deletions with non-template sequences ... Successfully created workflow run script. To execute the workflow, run the following script and set appropriate options: /varCA/out/callers/Three/illumina-strelka/runWorkflow.py Successfully created workflow run script. To execute the workflow, run the following script and set appropriate options: /varCA/out/callers/Two/illumina-strelka/runWorkflow.py

aryarm commented 2 years ago

hmm... yeah, I can't tell which job is the job that is failing from the log outputs that you included.

Here are some debugging strategies:

  1. Check to see if any of the output files have weird file sizes. This can be an indication that a job failed early on but Snakemake just didn't catch it. If that is the case, it can be useful to delete the files and rerun Snakemake after fixing the issue. Otherwise, Snakemake will continue to restart the workflow from the corrupted files. Every time you run the workflow, it will start from the last step that failed (exited with a non-zero exit code). But the issue might have been earlier.
  2. You can search for errors in the log files by grepping for rror or using less to search for it with the forward slash. (Note that I recommend rror instead of Error, in case the error is mentioned with a lowercase e error.)
  3. Once you figure out which job failed, you can execute run.bash with the -npr options to see what shell code it is running. Then, you can activate the conda environment that it is using and run the shell code manually, instead of letting Snakemake do it for you. That way, it will be easier to debug the code and catch any errors or warnings.

One of the requirements for the BAM files is that they have read-group information? Do yours?

umasstr commented 1 year ago

Hi @aryarm,

Just getting back to this. The pipeline runs past the previously mentioned errors by adding read groups. Thanks for the tips!

Unfortunately, I'm confronted with a more cryptic error later on. Log here

It looks like bcftools may not be installed as a dependency? Total runtime is a few hours per sample, so I wanted to check in to see if there was anything else I'm missing in the log.

Thanks!