Problem of quantification with htseq

lkw159159 commented 2 years ago

Dear UMI-tools group,

hello, when I tried to use UMI-tools to deduplicate and quantificated with htseq, there are some issues in my result.

Like this image, there are so many missing mate- reads that are not quantified.

here is my flow & code

1. twice fastp (Adapter & low quality reads trimming -> global 11bp trimming because of our own pipeline) fastp -i CTR12_1.fastq.gz -I CTR12_2.fastq.gz -o CTR12_1.tr1.fastq.gz -O CTR12_2.tr1.fastq.gz -h ../UMI-tools/CTR12.tr1.fastp.html -j ../UMI-tools/CTR12.tr1.fastp.json fastp -i CTR12_1.tr1.fastq.gz -I CTR12_2.tr1.fastq.gz -o ../UMI-tools/CTR12_1.tr11.fastq.gz -O ../UMI-tools/CTR12_2.tr11.fastq.gz -t 11 -h ../UMI-tools/CTR12.tr11.fastp.html -j ../UMI-tools/CTR12.tr11.fastp.json

2. UMI-tools extract umi_tools extract -I SMC1_1.tr11.fastq.gz -S SMC1_1.pr.fastq.gz --read2-in=SMC1_2.tr11.fastq.gz --read2-out=SMC1_2.pr.fastq.gz --log=SMC1_processed.log --either-read --bc-pattern="(?P.{1})(?P.{6}){s<=1}(?P.{4}).+" --bc-pattern2="(?P.{1})(?P.{6}){s<=1}(?P.{4}),+" --extract-method=regex

3. hisat2 alignment hisat2 -q -x ../../../ref/hisat2ref/genome_snp_tran -1 SMC1_1.pr.fastq.gz -2 SMC1_2.pr.fastq.gz -S SMC1.sam

4. samtools (sam to bam, sort, index) samtools view -bS SMC1.sam > SMC1.bam samtools sort SMC1.bam -o SMC1.sorted.bam samtools index SMC1.sorted.bam

5. UMI-tools dedup umi_tools dedup -I SMC1.sorted.bam --paired --output-stats=SMC1_ded -S SMC1_ded.bam

6. htseq-count quantification htseq-count -f bam -q -s no -i gene_id SMC4_ded.sorted.bam SMC5_ded.sorted.bam Homo_sapiens.GRCh38.106.chr.gtf.gz >counts.txt

+I read another issue about RSEM quantification and tried umi_tools prepare-for-rsem -I SMC4_ded_sort_n.bam --stdout SMC4_ded_ready.bam

but there was a new error message

Thanks

IanSudbery commented 2 years ago

From the htseq-count manual:

-r , --order= For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for to indicate how the input data has been sorted. The default is name. If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.

UMI-tools outputs reads in position sorted order, so you either need to tell htseq that they are sorted that way with -r pos or sort the BAM output by UMI-tools into name order with samtools sort -n

lkw159159 commented 2 years ago

Dear Ian,

Unfortunately, I've not solved this problem.

It's the data that were generated by UMI-tools index --> Hisat2 alignment --> UMI-tools dedup --> HTseq count assignment.

Still, I got only 50% of assignment rate from them.

even I tried using samtools sort -n, I got about 2 times size-increased BAM file and it's assignment read count is similar to others.

IanSudbery commented 2 years ago

Sorry I've not replied quicker, i've been on holiday.

Can we check that the reads are paired going in?

What do your read headers look like?

i.e:

$ zcat CTR12_1.fastq.gz | head -n1
$ zcat CTR12_2.fastq.gz | head -n1

$zcat ../UMI-tools/CTR12_1.tr11.fastq.gz | head -n1
$zcat  ../UMI-tools/CTR12_2.tr11.fastq.gz | head -n1

$ zcat SMC1_1.pr.fastq.gz | head -n1
$ SMC1_2.pr.fastq.gz | head -n1

CGATOxford / UMI-tools

Problem of quantification with htseq #530