Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 46 forks source link

help undestanding output #103

Open RichardCorbett opened 4 years ago

RichardCorbett commented 4 years ago

Hi dropseqpipers,

I have downsampled some single cell data to test for the quality of the results using the same number of reads per cell. I have given each of my 100 cells 10,000 reads in the fastq files.

After analysis I see that only have about ~3000 reads per cell that are used for the anlysis and I am trying to diagnose where the reads might have been filtered out. One possibility is that cutadapt is filtering reads for not having the correct structure, or perhaps too much adapter or similar.

Here is the cutadapt multiqc file:

Sample  r_processed r_with_adapters bp_processed    quality_trimmed bp_written  percent_trimmed
sample1_R2  950971  584570  118871375   12147260    77382403    34.90240774955283
sample2_R2  1058522 655457  132315250   12524937    90626084    31.507453600397533
sample3_R2  929245  557125  116155625   8035848 74496049    35.865310870653055
sample4_R2  958210  563527  119776250   9272229 83291408    30.46083175921771

I'm not very familiar with the output of cutadapt. Does the above indicate that there are many reads that are being filtered out?

Hoohm commented 4 years ago

Hey @RichardCorbett I would rather see the log file we create that is called clean_qc or even better, the yield plot if you got it.

RichardCorbett commented 4 years ago

Thanks @Hoohm,

I don't seem to have either of those:

find . ./samples ./samples/sample1 ./samples/sample1/top_barcodes.csv ./samples/sample1/barcodes.csv ./samples/sample1/trimmed_repaired_R1.fastq.gz ./samples/sample1/trimmed_repaired_R2.fastq.gz ./samples/sample1/empty_barcode_mapping.pkl ./samples/sample1/barcode_ref.pkl ./samples/sample1/barcode_ext_ref.pkl ./samples/sample1/Log.final.out ./samples/sample1/Log.out ./samples/sample1/_STARtmp find:./samples/sample1/_STARtmp': Permission denied ./samples/sample1/Log.progress.out ./samples/sample1/read ./samples/sample1/read/barcodes.tsv ./samples/sample1/read/genes.tsv ./samples/sample1/read/matrix.mtx ./samples/sample1/barcode_mapping_counts.pkl ./samples/sample1/SJ.out.tab ./samples/sample1/Unmapped.out.mate1.gz ./samples/sample1/final.bam ./samples/sample1/umi ./samples/sample1/umi/expression.long ./samples/sample1/umi/barcodes.tsv ./samples/sample1/umi/genes.tsv ./samples/sample1/umi/matrix.mtx ./logs ./logs/cutadapt ./logs/cutadapt/sample1_R2.qc.txt ./logs/cutadapt/sample1_R1.qc.txt ./logs/cutadapt/sample1.clean_qc.csv ./logs/fastqc ./logs/fastqc/sample1_R1_fastqc.html ./logs/fastqc/sample1_R1_fastqc.zip ./logs/fastqc/sample1_R2_fastqc.html ./logs/fastqc/sample1_R2_fastqc.zip ./logs/bbmap ./logs/bbmap/sample1_repair.txt ./logs/dropseq_tools ./logs/dropseq_tools/sample1_beadSubstitutionSummary.txt ./logs/dropseq_tools/sample1_beadSubstitutionReport.txt ./logs/dropseq_tools/sample1_synthesis_stats_summary.txt ./logs/dropseq_tools/sample1_hist_out_cell.txt ./logs/dropseq_tools/sample1_rna_metrics.txt ./reports ./reports/fastqc_barcodes_data ./reports/fastqc_barcodes_data/multiqc_sources.txt ./reports/fastqc_barcodes_data/multiqc_data.json ./reports/fastqc_barcodes_data/multiqc_general_stats.txt ./reports/fastqc_barcodes_data/multiqc_fastqc.txt ./reports/fastqc_barcodes_data/multiqc.log ./reports/fastqc_reads_data ./reports/fastqc_reads_data/multiqc_sources.txt ./reports/fastqc_reads_data/multiqc_data.json ./reports/fastqc_reads_data/multiqc_general_stats.txt ./reports/fastqc_reads_data/multiqc_fastqc.txt ./reports/fastqc_reads_data/multiqc.log ./reports/fastqc_barcodes.html ./reports/fastqc_reads.html ./reports/RNA_filtering_data ./reports/RNA_filtering_data/multiqc_sources.txt ./reports/RNA_filtering_data/multiqc_data.json ./reports/RNA_filtering_data/multiqc_cutadapt.txt ./reports/RNA_filtering_data/multiqc_general_stats.txt ./reports/RNA_filtering_data/multiqc.log ./reports/RNA_filtering.html ./reports/barcode_filtering_data ./reports/barcode_filtering_data/multiqc_sources.txt ./reports/barcode_filtering_data/multiqc_data.json ./reports/barcode_filtering_data/multiqc_cutadapt.txt ./reports/barcode_filtering_data/multiqc_general_stats.txt ./reports/barcode_filtering_data/multiqc.log ./reports/barcode_filtering.html ./reports/star_data ./reports/star_data/multiqc_sources.txt ./reports/star_data/multiqc_star.txt ./reports/star_data/multiqc_data.json ./reports/star_data/multiqc_general_stats.txt ./reports/star_data/multiqc.log ./reports/star.html ./plots ./plots/rna_metrics ./plots/knee_plots ./plots/knee_plots/sample1_knee_plot.pdf ./summary ./summary/read ./summary/read/barcodes.tsv ./summary/read/genes.tsv ./summary/read/matrix.mtx ./summary/umi ./summary/umi/barcodes.tsv ./summary/umi/genes.tsv ./summary/umi/matrix.mtx `

Hoohm commented 4 years ago

This is the one /logs/cutadapt/sample1.clean_qc.csv What about the plots folder?

On Wed, Apr 1, 2020 at 4:16 PM Richard Corbett notifications@github.com wrote:

Thanks @Hoohm https://github.com/Hoohm,

I don't seem to have either of those:

find . ./samples ./samples/sample1 ./samples/sample1/top_barcodes.csv ./samples/sample1/barcodes.csv ./samples/sample1/trimmed_repaired_R1.fastq.gz ./samples/sample1/trimmed_repaired_R2.fastq.gz ./samples/sample1/empty_barcode_mapping.pkl ./samples/sample1/barcode_ref.pkl ./samples/sample1/barcode_ext_ref.pkl ./samples/sample1/Log.final.out ./samples/sample1/Log.out ./samples/sample1/_STARtmp find: ./samples/sample1/_STARtmp': Permission denied ./samples/sample1/Log.progress.out ./samples/sample1/read ./samples/sample1/read/barcodes.tsv ./samples/sample1/read/genes.tsv ./samples/sample1/read/matrix.mtx ./samples/sample1/barcode_mapping_counts.pkl ./samples/sample1/SJ.out.tab ./samples/sample1/Unmapped.out.mate1.gz ./samples/sample1/final.bam ./samples/sample1/umi ./samples/sample1/umi/expression.long ./samples/sample1/umi/barcodes.tsv ./samples/sample1/umi/genes.tsv ./samples/sample1/umi/matrix.mtx ./logs ./logs/cutadapt ./logs/cutadapt/sample1_R2.qc.txt ./logs/cutadapt/sample1_R1.qc.txt ./logs/cutadapt/sample1.clean_qc.csv ./logs/fastqc ./logs/fastqc/sample1_R1_fastqc.html ./logs/fastqc/sample1_R1_fastqc.zip ./logs/fastqc/sample1_R2_fastqc.html ./logs/fastqc/sample1_R2_fastqc.zip ./logs/bbmap ./logs/bbmap/sample1_repair.txt ./logs/dropseq_tools ./logs/dropseq_tools/sample1_beadSubstitutionSummary.txt ./logs/dropseq_tools/sample1_beadSubstitutionReport.txt ./logs/dropseq_tools/sample1_synthesis_stats_summary.txt ./logs/dropseq_tools/sample1_hist_out_cell.txt ./logs/dropseq_tools/sample1_rna_metrics.txt ./reports ./reports/fastqc_barcodes_data ./reports/fastqc_barcodes_data/multiqc_sources.txt ./reports/fastqc_barcodes_data/multiqc_data.json ./reports/fastqc_barcodes_data/multiqc_general_stats.txt ./reports/fastqc_barcodes_data/multiqc_fastqc.txt ./reports/fastqc_barcodes_data/multiqc.log ./reports/fastqc_reads_data ./reports/fastqc_reads_data/multiqc_sources.txt ./reports/fastqc_reads_data/multiqc_data.json ./reports/fastqc_reads_data/multiqc_general_stats.txt ./reports/fastqc_reads_data/multiqc_fastqc.txt ./reports/fastqc_reads_data/multiqc.log ./reports/fastqc_barcodes.html ./reports/fastqc_reads.html ./reports/RNA_filtering_data ./reports/RNA_filtering_data/multiqc_sources.txt ./reports/RNA_filtering_data/multiqc_data.json ./reports/RNA_filtering_data/multiqc_cutadapt.txt ./reports/RNA_filtering_data/multiqc_general_stats.txt ./reports/RNA_filtering_data/multiqc.log ./reports/RNA_filtering.html ./reports/barcode_filtering_data ./reports/barcode_filtering_data/multiqc_sources.txt ./reports/barcode_filtering_data/multiqc_data.json ./reports/barcode_filtering_data/multiqc_cutadapt.txt ./reports/barcode_filtering_data/multiqc_general_stats.txt ./reports/barcode_filtering_data/multiqc.log ./reports/barcode_filtering.html ./reports/star_data ./reports/star_data/multiqc_sources.txt ./reports/star_data/multiqc_star.txt ./reports/star_data/multiqc_data.json ./reports/star_data/multiqc_general_stats.txt ./reports/star_data/multiqc.log ./reports/star.html ./plots ./plots/rna_metrics ./plots/knee_plots ./plots/knee_plots/sample1_knee_plot.pdf ./summary ./summary/read ./summary/read/barcodes.tsv ./summary/read/genes.tsv ./summary/read/matrix.mtx ./summary/umi ./summary/umi/barcodes.tsv ./summary/umi/genes.tsv ./summary/umi/matrix.mtx `

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/103#issuecomment-607276186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2A4J4JKBWIGIZGK3Q3RKNEELANCNFSM4LYGPNSQ .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

RichardCorbett commented 4 years ago

The plots folder is there, I think that some plots failed, perhaps due to an R library problem on the server. I do have the plots for the same data pre-downsampled (ie. not just 100 cells at 10K reads, but all the data for all the cells:

image

clean_qc file:

Adapter,Sequence,Pair,Count Illumina_Universal,AGATCGGAAGAG,R1,5 PrefixNX/1,AGATGTGTATAAGAGACAG,R1,0 Trans1,TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG,R1,0 Trans1_rc,CTGTCTCTTATACACATCTGACGCTGCCGACGA,R1,697 Trans2,GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG,R1,0 Trans2_rc,CTGTCTCTTATACACATCTCCGAGCCCACGAGAC,R1,62525 polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R1,6129 polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R1,780065 polyC,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC,R1,5133 polyG,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG,R1,434 drop-seq,GTACTCTGCGTTGATACCACTGCTTCCGCGGACAGGC,R1,0 Nextera,CTGTCTCTTATACACATCT,R1,104 Illumina_Universal,AGATCGGAAGAG,R2,843 PrefixNX/1,AGATGTGTATAAGAGACAG,R2,627 Trans1,TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG,R2,44 Trans1_rc,CTGTCTCTTATACACATCTGACGCTGCCGACGA,R2,815 Trans2,GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG,R2,31 Trans2_rc,CTGTCTCTTATACACATCTCCGAGCCCACGAGAC,R2,25 polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R2,505027 polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R2,2125 polyC,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC,R2,7866 polyG,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG,R2,3347 drop-seq,GTACTCTGCGTTGATACCACTGCTTCCGCGGACAGGC,R2,63726 Nextera,CTGTCTCTTATACACATCT,R2,94

thanks!

RichardCorbett commented 4 years ago

Hi @Hoohm, Do the plot or data above help you understand how we're only seeing about 3500 reads per cell in our read-based matrix.mtx file? If I understand the plot correctly I expected to see about twice the number that we're seeing. Any thoughts would be appreciated.

Hoohm commented 4 years ago

The best plot to help would be the yield plot, but for some reason it crashed. What about the star multiqc report?

On Fri, 3 Apr 2020, 20:27 Richard Corbett, notifications@github.com wrote:

Hi @Hoohm https://github.com/Hoohm, Do the plot or data above help you understand how we're only seeing about 3500 reads per cell in our read-based matrix.mtx file? If I understand the plot correctly I expected to see about twice the number that we're seeing. Any thoughts would be appreciated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/103#issuecomment-608592913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2GXOHI5ZNV7SBQKU4DRKYTBLANCNFSM4LYGPNSQ .

RichardCorbett commented 4 years ago

Star results... image

Hoohm commented 4 years ago

When you say 3000 reads, to mean that's the total for each cell? 3000 reads or umi?

On Fri, 3 Apr 2020, 23:10 Richard Corbett, notifications@github.com wrote:

Star results... [image: image] https://user-images.githubusercontent.com/9285915/78404934-5347d100-75b4-11ea-9678-a71fe39f6d26.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/103#issuecomment-608667744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2F6ZEHJ4M7DTIB2BT3RKZGD7ANCNFSM4LYGPNSQ .

RichardCorbett commented 4 years ago

The 3000 is the total reads per cell. I also have the UMI counts which average down around 1000 counts per cell.

Hoohm commented 4 years ago

From the cleanqc I see that around 800'000 of your R1 has been trimmed, that means your left with around 200'000 reads after trimming. From there, you loose around 20% so you end up with 150'000 so, assuming similar distribution of your data for all the cells, this would be around 1500 reads per cell.

RichardCorbett commented 4 years ago

Great. Would it be correct to interpret this as the reads have lots of polyA/T included that don't contain enough 3' RNA to be useful?

RichardCorbett commented 4 years ago

I just heard that our libraries for this test were 100-150bp shorter than intended which seems to be concordant with the interpretation that many of our read pairs did not contain significant mRNA signal adjacent to the polyA sequence as shown by these lines from the clean_qc file:

polyT,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,R1,780065 polyA,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,R2,505027

Does this seem to make sense to you?

thanks RIchard

Hoohm commented 4 years ago

Yes, seems to be inline with my interpretation.