Closed dylkot closed 4 years ago
This is odd. What's happening is that one read from R2 is missing in R1. This should not happen since bbmap is re-pairing reads from both R1 and R2 before mapping ensuring you only have the same reads in both files. Did you interact with those files in any manual way?
edit: Adding the read id to the exit message so that it is easier to debug. You can use the develop branch to get this info.
Thanks for this! Now I have the helpful error message:
Read C7JT8:1:1102:13714:1859 from mapped file is missing in reference fastq file!
I'm checking it out and that read is definitely in both the input _R1.fastq.gz and _R2.fastq.gz files:
Read 1: @C7JT8:1:1102:13714:1859/1 GTGGTGGGTTACAGTGAGCT + CCCCCGGGGGGGGGGGGGGG
Read 2: @C7JT8:1:1102:13714:1859/2 GGCCAGGCTGGTCTCAAACTCCTGACCTCAGGCAATCCGCCCACCTTGGCCTCCCAAAGTGCTGAGGAACCCAGTTTGAAAACCATTC + CCCCCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
It is also in aligned.out.bam:
C7JT8:1:1102:13714:1859 0 10 63052789 255 66M22S * 0 0 GGCCAGGCTGGTCTCAAACTCCTGACCTCAGGCAATCCGCCCACCTTGGCCTCCCAAAGTGCTGAGGAACCCAGTTTGAAAACCATTC CCCCCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:1 HI:i:1 AS:i:64 nM:i:0
and in trimmmed_repaired_R1.fastq.gz where it doesn't appear to have been trimmed any so it is identical to the read in the_R1.fastq.gz.
Any idea what could be going on? I would try to debug further myself but the relevant python file /home/dropSeqPipe/.snakemake/scripts/tmpw0q_o4n2.merge_bam.py seems to disapear.
I guess it comes from the /1
and /2
at the end of the read id. R1 is read by a fastq parser from SeqIO and the bam file is read by a pysam parser. I think pysam is not keeping the /2
at the end. This would explain the problem of not finding the read although it is there.
I tried a quick fix by deleting the end of the read id name in the branch feature/debugging_mergeBam.
Try it out and let me know if it works.
edit: The tmp scripts are deleted by snakemake. That is normal, don't worry.
That seems to have fixed that issue! But now it is crashing at the repair_barcodes step with a very non-descriptive error message:
Building DAG of jobs...
Creating conda environment https:/bitbucket.org/snakemake/snakemake-wrappers/raw/0.27.1/bio/fastqc/environment.yaml...
Downloading remote packages.
Environment for ../../tmp/tmp1bgk4k5i.yaml created (location: .snakemake/conda/73b3d757)
Creating conda environment envs/plots_ext.yaml...
Downloading remote packages.
Environment for envs/plots_ext.yaml created (location: .snakemake/conda/1290ea5a)
Creating conda environment envs/cutadapt.yaml...
Downloading remote packages.
Environment for envs/cutadapt.yaml created (location: .snakemake/conda/7dc41205)
Creating conda environment envs/star.yaml...
Downloading remote packages.
Environment for envs/star.yaml created (location: .snakemake/conda/fe1064ae)
Creating conda environment envs/dropseq_tools.yaml...
Downloading remote packages.
Environment for envs/dropseq_tools.yaml created (location: .snakemake/conda/dd296d1f)
Creating conda environment https:/bitbucket.org/snakemake/snakemake-wrappers/raw/0.21.0/bio/multiqc/environment.yaml...
Downloading remote packages.
Environment for ../../tmp/tmp2g315g_j.yaml created (location: .snakemake/conda/81acb004)
Creating conda environment envs/merge_bam.yaml...
Downloading remote packages.
Environment for envs/merge_bam.yaml created (location: .snakemake/conda/4b9c1953)
Creating conda environment envs/plots.yaml...
Downloading remote packages.
Environment for envs/plots.yaml created (location: .snakemake/conda/840da6c6)
Creating conda environment https:/bitbucket.org/snakemake/snakemake-wrappers/raw/0.27.1/bio/star/align/environment.yaml...
Downloading remote packages.
Environment for ../../tmp/tmp6n23qi3n.yaml created (location: .snakemake/conda/54fabd57)
Creating conda environment envs/bbmap.yaml...
Downloading remote packages.
Environment for envs/bbmap.yaml created (location: .snakemake/conda/8c850d6e)
Creating conda environment envs/picard.yaml...
Downloading remote packages.
Environment for envs/picard.yaml created (location: .snakemake/conda/8331163d)
Using shell: /bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DetectBeadSubstitutionErrors
1 MergeBamAlignment
1 STAR_align
1 SingleCellRnaSeqMetricsCollector
1 TagReadWithGeneExon
1 all
1 bam_hist
1 bead_errors_metrics
1 clean_cutadapt
2 convert_long_to_mtx
1 create_dict
1 create_intervals
1 create_refFlat
1 create_star_index
1 curate_annotation
1 cutadapt_R1
1 cutadapt_R2
1 extend_barcode_whitelist
1 extract_reads_expression
1 extract_umi_expression
1 fastqc_barcodes
1 fastqc_reads
2 merge_long
1 multiqc_cutadapt_RNA
1 multiqc_cutadapt_barcodes
1 multiqc_fastqc_barcodes
1 multiqc_fastqc_reads
1 multiqc_star
1 plot_adapter_content
1 plot_knee_plot
1 plot_rna_metrics
1 plot_yield
1 reduce_gtf
1 repair
1 repair_barcodes
1 violine_plots
38
[Mon Dec 31 04:38:24 2018]
localrule extend_barcode_whitelist:
input: /home/barcode_whitelist.txt
output: /home/results/samples/RA0449.0/barcodes.csv, /home/results/samples/RA0449.0/barcode_ref.pkl, /home/results/samples/RA0449.0/barcode_ext_ref.pkl, /home/results/samples/RA0449.0/empty_barcode_mapping.pkl
jobid: 24
wildcards: results_dir=/home/results, sample=RA0449.0
[Mon Dec 31 04:38:24 2018]
rule fastqc_reads:
input: /home/data/RA0449.0_R2.fastq.gz
output: /home/results/logs/fastqc/RA0449.0_R2_fastqc.html, /home/results/logs/fastqc/RA0449.0_R2_fastqc.zip
jobid: 18
wildcards: results_dir=/home/results, sample=RA0449.0
[Mon Dec 31 04:38:24 2018]
rule fastqc_barcodes:
input: /home/data/RA0449.0_R1.fastq.gz
output: /home/results/logs/fastqc/RA0449.0_R1_fastqc.html, /home/results/logs/fastqc/RA0449.0_R1_fastqc.zip
jobid: 19
wildcards: results_dir=/home/results, sample=RA0449.0
[Mon Dec 31 04:38:24 2018]
localrule create_dict:
input: /home/ref/MmulKitwit_8_92/genome.fa
output: /home/ref/MmulKitwit_8_92/genome.dict
jobid: 35
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/8331163d
[Mon Dec 31 04:38:24 2018]
localrule curate_annotation:
input: /home/dropSeqPipe/templates/gtf_biotypes.yaml, /home/ref/MmulKitwit_8_92/annotation.gtf
output: /home/ref/MmulKitwit_8_92/curated_annotation.gtf
jobid: 17
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92
[Mon Dec 31 04:38:24 2018]
Finished job 24.
1 of 38 steps (3%) done
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/73b3d757
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/73b3d757
[Mon Dec 31 04:38:26 2018]
Finished job 17.
2 of 38 steps (5%) done
[Mon Dec 31 04:38:54 2018]
Finished job 19.
3 of 38 steps (8%) done
[Mon Dec 31 04:38:54 2018]
localrule multiqc_fastqc_barcodes:
input: /home/results/logs/fastqc/RA0449.0_R1_fastqc.html
output: /home/results/reports/fastqc_barcodes.html
jobid: 3
wildcards: results_dir=/home/results
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/81acb004
[Mon Dec 31 04:38:55 2018]
Finished job 35.
4 of 38 steps (11%) done
[Mon Dec 31 04:38:55 2018]
localrule create_refFlat:
input: /home/ref/MmulKitwit_8_92/genome.dict, /home/ref/MmulKitwit_8_92/curated_annotation.gtf
output: /home/ref/MmulKitwit_8_92/curated_annotation.refFlat
jobid: 32
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/dd296d1f
[Mon Dec 31 04:38:55 2018]
localrule reduce_gtf:
input: /home/ref/MmulKitwit_8_92/genome.dict, /home/ref/MmulKitwit_8_92/curated_annotation.gtf
output: /home/ref/MmulKitwit_8_92/curated_reduced_annotation.gtf
jobid: 36
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/dd296d1f
[Mon Dec 31 04:38:58 2018]
Finished job 3.
5 of 38 steps (13%) done
[Mon Dec 31 04:39:27 2018]
Finished job 18.
6 of 38 steps (16%) done
[Mon Dec 31 04:39:27 2018]
localrule multiqc_fastqc_reads:
input: /home/results/logs/fastqc/RA0449.0_R2_fastqc.html
output: /home/results/reports/fastqc_reads.html
jobid: 2
wildcards: results_dir=/home/results
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/81acb004
[Mon Dec 31 04:39:30 2018]
Finished job 2.
7 of 38 steps (18%) done
[Mon Dec 31 04:39:38 2018]
Finished job 36.
8 of 38 steps (21%) done
[Mon Dec 31 04:39:38 2018]
localrule create_intervals:
input: /home/ref/MmulKitwit_8_92/curated_reduced_annotation.gtf, /home/ref/MmulKitwit_8_92/genome.dict
output: /home/ref/MmulKitwit_8_92/annotation.rRNA.intervals
jobid: 33
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/dd296d1f
[Mon Dec 31 04:39:40 2018]
Finished job 32.
9 of 38 steps (24%) done
[Mon Dec 31 04:40:20 2018]
Finished job 33.
10 of 38 steps (26%) done
[Mon Dec 31 04:40:20 2018]
rule create_star_index:
input: /home/ref/MmulKitwit_8_92/genome.fa, /home/ref/MmulKitwit_8_92/curated_annotation.gtf
output: /home/ref/MmulKitwit_8_92/STAR_INDEX/SA_88/SA
jobid: 1
wildcards: ref_path=/home/ref, species=MmulKitwit, build=8, release=92, read_length=88
threads: 6
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/fe1064ae
Removing temporary output file /home/ref/MmulKitwit_8_92/curated_annotation.gtf.
[Mon Dec 31 05:47:25 2018]
Finished job 1.
11 of 38 steps (29%) done
[Mon Dec 31 05:47:25 2018]
rule cutadapt_R2:
input: /home/data/RA0449.0_R2.fastq.gz, /home/NexteraPE-SeqWell-PE-fastqc.fa
output: /home/results/samples/RA0449.0/trimmmed_R2.fastq.gz
log: /home/results/logs/cutadapt/RA0449.0_R2.qc.txt
jobid: 22
wildcards: results_dir=/home/results, sample=RA0449.0
threads: 6
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/7dc41205
[Mon Dec 31 05:51:12 2018]
Finished job 22.
12 of 38 steps (32%) done
[Mon Dec 31 05:51:12 2018]
rule cutadapt_R1:
input: /home/data/RA0449.0_R1.fastq.gz, /home/NexteraPE-SeqWell-PE-fastqc.fa
output: /home/results/samples/RA0449.0/trimmmed_R1.fastq.gz
log: /home/results/logs/cutadapt/RA0449.0_R1.qc.txt
jobid: 21
wildcards: results_dir=/home/results, sample=RA0449.0
threads: 6
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/7dc41205
[Mon Dec 31 05:53:15 2018]
Finished job 21.
13 of 38 steps (34%) done
[Mon Dec 31 05:53:15 2018]
localrule clean_cutadapt:
input: /home/results/logs/cutadapt/RA0449.0_R1.qc.txt, /home/results/logs/cutadapt/RA0449.0_R2.qc.txt
output: /home/results/logs/cutadapt/RA0449.0.clean_qc.csv
jobid: 20
wildcards: results_dir=/home/results, sample=RA0449.0
[Mon Dec 31 05:53:15 2018]
rule repair:
input: /home/results/samples/RA0449.0/trimmmed_R1.fastq.gz, /home/results/samples/RA0449.0/trimmmed_R2.fastq.gz
output: /home/results/samples/RA0449.0/trimmmed_repaired_R1.fastq.gz, /home/results/samples/RA0449.0/trimmmed_repaired_R2.fastq.gz
log: /home/results/logs/bbmap/RA0449.0_repair.txt
jobid: 26
wildcards: results_dir=/home/results, sample=RA0449.0
threads: 4
[Mon Dec 31 05:53:15 2018]
localrule multiqc_cutadapt_RNA:
input: /home/results/logs/cutadapt/RA0449.0_R2.qc.txt
output: /home/results/reports/RNA_filtering.html
jobid: 6
wildcards: results_dir=/home/results
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/8c850d6e
[Mon Dec 31 05:53:16 2018]
Finished job 20.
14 of 38 steps (37%) done
[Mon Dec 31 05:53:16 2018]
localrule multiqc_cutadapt_barcodes:
input: /home/results/logs/cutadapt/RA0449.0_R1.qc.txt
output: /home/results/reports/barcode_filtering.html
jobid: 5
wildcards: results_dir=/home/results
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/81acb004
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/81acb004
[Mon Dec 31 05:53:19 2018]
Finished job 6.
15 of 38 steps (39%) done
[Mon Dec 31 05:53:19 2018]
localrule plot_adapter_content:
input: /home/results/logs/cutadapt/RA0449.0.clean_qc.csv
output: /home/results/plots/adapter_content.pdf
jobid: 4
wildcards: results_dir=/home/results
[Mon Dec 31 05:53:19 2018]
Finished job 5.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/840da6c6
16 of 38 steps (42%) done
[Mon Dec 31 05:53:26 2018]
Finished job 4.
17 of 38 steps (45%) done
Removing temporary output file /home/results/samples/RA0449.0/trimmmed_R1.fastq.gz.
Removing temporary output file /home/results/samples/RA0449.0/trimmmed_R2.fastq.gz.
[Mon Dec 31 05:53:47 2018]
Finished job 26.
18 of 38 steps (47%) done
[Mon Dec 31 05:53:47 2018]
rule STAR_align:
input: /home/results/samples/RA0449.0/trimmmed_repaired_R2.fastq.gz, /home/ref/MmulKitwit_8_92/STAR_INDEX/SA_88/SA
output: /home/results/samples/RA0449.0/Aligned.out.bam
log: /home/results/samples/RA0449.0/Log.final.out
jobid: 25
wildcards: results_dir=/home/results, sample=RA0449.0
threads: 6
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/54fabd57
[Mon Dec 31 05:59:07 2018]
Finished job 25.
19 of 38 steps (50%) done
[Mon Dec 31 05:59:07 2018]
localrule multiqc_star:
input: /home/results/samples/RA0449.0/Log.final.out
output: /home/results/reports/star.html
jobid: 8
wildcards: results_dir=/home/results
[Mon Dec 31 05:59:07 2018]
rule MergeBamAlignment:
input: /home/results/samples/RA0449.0/Aligned.out.bam, /home/results/samples/RA0449.0/trimmmed_repaired_R1.fastq.gz
output: /home/results/samples/RA0449.0/Aligned.merged.bam
jobid: 39
wildcards: results_dir=/home/results, sample=RA0449.0
[Mon Dec 31 05:59:07 2018]
localrule plot_yield:
input: /home/results/logs/cutadapt/RA0449.0_R1.qc.txt, /home/results/logs/cutadapt/RA0449.0_R2.qc.txt, /home/results/logs/bbmap/RA0449.0_repair.txt, /home/results/samples/RA0449.0/Log.final.out
output: /home/results/plots/yield.pdf
jobid: 9
wildcards: results_dir=/home/results
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/4b9c1953
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/840da6c6
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/81acb004
[Mon Dec 31 05:59:12 2018]
Finished job 8.
20 of 38 steps (53%) done
[Mon Dec 31 05:59:15 2018]
Finished job 9.
21 of 38 steps (55%) done
Removing temporary output file /home/results/samples/RA0449.0/Aligned.out.bam.
[Mon Dec 31 06:01:43 2018]
Finished job 39.
22 of 38 steps (58%) done
[Mon Dec 31 06:01:43 2018]
rule repair_barcodes:
input: /home/results/samples/RA0449.0/Aligned.merged.bam, /home/results/samples/RA0449.0/barcode_ref.pkl, /home/results/samples/RA0449.0/barcode_ext_ref.pkl, /home/results/samples/RA0449.0/empty_barcode_mapping.pkl
output: /home/results/samples/RA0449.0/Aligned.repaired.bam, /home/results/samples/RA0449.0/barcode_mapping_counts.pkl
jobid: 38
wildcards: results_dir=/home/results, sample=RA0449.0
Activating conda environment: /home/dropSeqPipe/.snakemake/conda/4b9c1953
[Mon Dec 31 06:01:44 2018]
Error in rule repair_barcodes:
jobid: 38
output: /home/results/samples/RA0449.0/Aligned.repaired.bam, /home/results/samples/RA0449.0/barcode_mapping_counts.pkl
conda-env: /home/dropSeqPipe/.snakemake/conda/4b9c1953
RuleException:
CalledProcessError in line 72 of /home/dropSeqPipe/rules/cell_barcodes.smk:
Command 'source activate /home/dropSeqPipe/.snakemake/conda/4b9c1953; set -euo pipefail; python /home/dropSeqPipe/.snakemake/scripts/tmpfqi965ma.repair_barcodes.py ' returned non-zero exit status 1.
File "/home/dropSeqPipe/rules/cell_barcodes.smk", line 72, in __rule_repair_barcodes
File "/opt/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job repair_barcodes since they might be corrupted:
/home/results/samples/RA0449.0/Aligned.repaired.bam
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/dropSeqPipe/.snakemake/log/2018-12-31T042813.325812.snakemake.log
I'll be out of town until 1/8 but I am happy to try to debug this further when I get back.
Ok have fun.
When you come back can you find the name of the sequencer that produced the data? That might help implement a different standard.
I have modified the same branch again, might have fixed the issue. I think it comes from a split I did on read name to get the lane of the sequencer.
That was data from a miseq. We ultimately traced the issue to a read name issue deriving from the fact that the data was demux'd with Picard rather than bcl2fastq2. Several steps didn't like having \1 and \2 in the read names indicating which mate they belonged to. And also, the lane information occurred in a different place in the read name. We want to use bcl2fastq2 going forward anyway so we have been avoiding this problem. But worth just letting you know the source
Thanks @dylkot. I'll close this now since it has been fixed.
I'm getting the following error message at the MergeBamAlignment phase of the pipeline
Maybe I am missing something about how the pipeline is functioning but when I try to investigate /home/dropSeqPipe/.snakemake/scripts/tmpw0q_o4n2.merge_bam.py the file doesn't seem to be there. The input files:
/home/results/samples/RA0449.0/Aligned.out.bam, /home/results/samples/RA0449.0/trimmmed_repaired_R1.fastq.gz
seem normal as far as I can tell except that there are a fair number of very trimmed reads in Aligned.out.bam. We started out with 88BP libraries but we are having an adapter contamination issue where many of the reads are made up predominantly of SeqB_rc. I'm not sure if that could be relevant to the issue at all. Thanks!