Customized QC: use the majority barcode to name the reads and sample

Original Question

In a fastq file, the barcode of some reads is different from the others
Sometimes the barcode in the first reads is the minority barcode (different from the majority barcode recoded in current fastq file)

Wen's comments

It seems to me that the small proportion of reads with a single base difference in the barcode section might be related to the demultiplex tools used to create these fastq files? If I remember correctly, we allow about 1 base mismatch for every 10 bases of index when we do demultiplex. For this dual index of 16 base, if they use similar threshold, they might end up having some barcode not 100% percent match? Since our barcode sequence is from the lab, they are surely correct seq. Here I don’t think Clifton pass that info to us and we have to infer from fastq files.

Solution

Collect the first five reads, and choose the majority barcode as the barcode of current reads.

Most Valuable Comments from Kristie

Hi Xin- I’m assuming this is externally generated data, since I don’t see the flowcell id in our lims?

So a few things could be causing the single-base mismatches you’re seeing. One is PCR error, either during library prep or during clustering on the flowcell. The other is sequencing error (ie, calling the wrong base during sequencing). Most barcode sets are actually designed so that even if your barcode read(s) have an error, they are unique enough that they won’t be confused with another barcode in the set. And most demultiplexing defaults (including ours, I believe), allow for 1 base mismatch in the barcode compared to the expected sequence.

So I think the answer is that yes, this happens. It’s caused by errors that are a natural part of the process (so, unintentional, but expected). And as long as there aren’t >1 mismatches per barcode, it is generally “allowable” and the reads do not need to be discarded.

For reference, if you look at the html demux reports in one of our recent flowcell directories, you can see that about 3 percent of our binned reads have one mismatch in the barcode (though it varies a bit from sample to sample):

T:\DCEG\CGF\Sequencing\Illumina\HiSeq\PostRun_Analysis\Data\210615_A00423_0127_AHCJYGDSX2\CASAVA\L1\Reports\html\HCJYGDSX2\all\all\all

Hope this helps!

NCI-CGR / IlluminaSequencingAnalysis