UMI counting bug when reads duplicated

ShuyangXu commented 2 years ago

sorry for public this issue before completely editing in first time

hi,

Recently, I ran cellranger with an inaccurate fastq result which contains some duplicated reads(same id, same sequence).

And I filtered them then rerun cellranger again. But I found UMI counting in these two results are different, which is a little weird because, as well known, UMIs are counted by unique UMI number, not by reads number. For that reason, these duplicated reads should be merged into same UMI and contribute nothing.

Then I used example fastq to test and reproduce the issue.

reproduce issue

create a fastq file contains duplicated reads

cat /path/to/cellranger/external/cellranger_tiny_fastq/tinygex_S1_L001_R1_001.fastq.gz /path/to/cellranger/external/cellranger_tiny_fastq/tinygex_S1_L001_R1_001.fastq.gz > dup_S1_L001_R1_001.fastq.gz
cat /path/to/cellranger/external/cellranger_tiny_fastq/tinygex_S1_L001_R2_001.fastq.gz /path/to/cellranger/external/cellranger_tiny_fastq/tinygex_S1_L001_R2_001.fastq.gz > dup_S1_L001_R2_001.fastq.gz

run cellranger

/path/to/cellranger/cellranger count --transcriptome /path/to/cellranger/external/cellranger_tiny_ref/3.0.0/ --fastqs ./ --sample dup --id dup
/path/to/cellranger/cellranger count --transcriptome /path/to/cellranger/external/cellranger_tiny_ref/3.0.0/ --fastqs  /path/to/cellranger/external/cellranger_tiny_fastq --sample tinygex --lanes 1 --id normal

result

normal

web_summary.html

matrix.mtx.gz

dup

web_summary.html

matrix.mtx.gz

discuss

a) First, as expect, dup's Number of Reads is double. However UMIs are double as well.

Then I looked in dup's molecule_info.h5

molecule_info.h5/umi

molecule_info.h5/barcode_idx

Same UMI in same barcode became double.

b) And I also notice cellrange will throw an error

Duplicate FASTQs found between Sample XXX and Sample XXX

when input duplicated reads in two different fastq files. (not like in one file as above)

In conclusion, I wonder if there might be two issues: a) is counting UMI error when reads duplicated? b) is the condition of duplicate reads in one file unconsidered?

Thank you

evolvedmicrobe commented 2 years ago

??

ShuyangXu commented 2 years ago

??

sorry, typo before

I have re-edited it again.

evolvedmicrobe commented 2 years ago

Hi @ShuyangXu, Based on your command, I think you might be using an outdated version of Cell Ranger, but in any event I suspect this is due to duplicate QNAMES in your FASTQ file.

Earlier versions of Cell Ranger assumed that the input FASTQ files were produced by Illumina and so would not have duplicated QNAMES. Given a set of reads with a given barcode/umi combination, we previously marked the read with the "lowest" QNAME as the one that should be counted. However, we discovered (thanks to a report like this) that customer would accidentally create FASTQ files with duplicated QNAMES, and this duplication led to a given barcode/umi being double counted, as the assumption that the read name was unique is violated in data that has multiple copies of the same read name. Does your FASTQ file indicate that the QNAMEs are not unique, e.g. below are two occurrences of a QNAME that would cause this issue in a FASTQ file if it appeared twice:

@A01182:88:HCWMWDSX3:1:2626:20365:26052 1:N:0:TCCAACAACG+AAACCCGGAC
@A01182:88:HCWMWDSX3:1:2626:20365:26052 1:N:0:TCCAACAACG+AAACCCGGAC

Are all the QNAMEs in your FASTQ file unique?

ShuyangXu commented 2 years ago

Thanks for your reply.

As I mentioned in the beginning, not all the QNAMEs in my data are unique. A fellow combined the data by mistake.
This problem could also be reproduce by using CellRanger v7.0.0 with the additional data (./external/cellranger_tiny_fastq/)
Considering the computer speed and performance, I could understand the reason why you use such a trick to count UMIs, but I think it may need more robust updates.

10XGenomics / cellranger