Teichlab / scg_lib_structs

Collections of library structure and sequence of popular single cell genomic methods
415 stars 87 forks source link

Truseq Read1 and mixed single/dual demultiplexing #12

Open plijnzaad opened 2 years ago

plijnzaad commented 2 years ago

Dear Xi,

many many thanks for this brilliant resource! I had one question: on https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html, the 'Truseq Read1' sequences under 'Truseq Single Index Library' and 'Truseq Dual Index Library' differ slightly. The first one has AGAAAGGGATGTGCTGCGAGAAGGCTAGA where as the second one is ACACTCTTTCCCTACACGACGCTCTTCCGATCT (so has an extra 'ACAC' in front). Do you know which is which?

I'm asking because we often pool single and dual index libraries into one flowcell. If I then I demultiplex (using bcl2fastq v2.20) the TruSingle Index as if it were a dual Index library I would expect the 'fake' second index (i5) to be GTGTAGATCT, i.e. the first 10 nucleotides of the (reverse complement of the) Illumina P5 adapter. Instead I get GGGGGGGGGG. Is this because there is no proper primer for the 'fake' i5 index read?

Many thanks! Philip

dbrg77 commented 2 years ago

Hi Philip,

Thanks. First, note that the P5 always ends with ACAC.

For TruSeq single, Read 1 is: 5'- TCTTTCCCTACACGACGCTCTTCCGATCT -3'

For TruSeq dual, Read 1 is: 5'- ACACTCTTTCCCTACACGACGCTCTTCCGATCT -3', with an extra ACAC at the 5' end, compared to the single one.

If i5 is sequenced using the top strand, you need an i5 sequencing primer. The Illumina kit uses this primer to sequence i5: 5'- AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -3'. This works for both single and dual libraries.

Note the GTGT at the 3' end of the sequencing primer will anneal to the 3' end of P5, which is ACAC. Therefore, for your single index library, the 'fake' i5 should be AGATCTCGGT. This is my understanding.

The GGGGGGGGGG seems strange, and it seems as if the polymerase hits the flow cell and gives no signal. What is the percentage of the polyG in your single index sample?

Xi

plijnzaad commented 2 years ago

Hi Xi,

thank you for you quick reply! The large majority of single indexed reads have GGGGGGGGGG as the fake second index read. Demultiplexing with this as Index2 gives yields per library that are close to what we expected based on the lab measurements. The downstream analyses of these libraries are also fine (this is on a Novaseq6000, BTW). The number of unassigned reads is 5%, which is partly due to PhiX spiking. Amongst the Unassigned are a few AGATCTCGGT index2 reads, but their index1 is nearly always GGGGGGGGGG (and a small minority is one of the bonafide single-index barcodes). In other words, I think I can trust the data, it's just the GGGGGGGGGG index2 for the single-indexed libraries is puzzling. BTW the numbers of dual indexed reads also correspond to what we expected based on the lab measurements.

I was thinking it may be the result of some implicit adapter trimming that bcl2fastq does, but the logfiles say nothing has been trimmed, and running bcl2fastq with trimadapter or maskadapter options doesn't change anything. If the Illumina kit would not have the proper primer for the reading of index2 in the case of our single-indexed libraries, it would explain the absence of signal, hence my question. But I guess I can live with it for now :-) Thanks, Philip

dbrg77 commented 2 years ago

I see. That is strange, and I will leave this open to see if other people have some explanations.

plijnzaad commented 2 years ago

I have pointed Illumina Tech Support to this discussion, maybe they can shed light on it (hopefully in this thread :-)

edit: I have been in contact with Illumina support, and they say the GGGGGGGGG is exactly as expected because the index primer2 is not able to anneal in the single indexed libraries. They could not provide full details because some of their primer sequences are proprietary. This is fine by me, so I guess this issue can be considered resolved :-)

plijnzaad commented 2 years ago

We're not done yet ... I had a different run today, where I used GGGGGGGGGGG as the 'fake' i5 in a flowcell of otherwise dual-indexed samples, as we (and Illumina) concluded was correct. This time I did not find those single-indexed reads back. Looking at the top-Undetermined reads, they had indexes like TATGATTCAT+GTGTAGATCT. The first of the pair is the first (of four) SI-GA-A4 oligos (chromium shared indexes, but extended with AT to length 10), the second of the pair is what my original expectation was (see https://github.com/Teichlab/scg_lib_structs/issues/12#issue-1016040659) . The numbers of reads for the four shared sample indexes add up to exactly what we expect, so for now I will just assume they are correct. In the future, I will use both GGGGGGGGGGG and GTGTAGATCT to demultiplex such samples ...

dbrg77 commented 2 years ago

Hi Philip,

Sorry for the late reply. I guess the i5 sequence readout on the TruSeq Single Index library really depends on the primer sequence at the P5 side. From the latest Illumina Adapter Sequence guide (Document # 1000000002694 v16), the P5 side of the sequence is (see Page 51, TruSeq Universal Adapter):

5'- AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -3'

I think the primer used to sequence i5 should be 5'- AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -3', meaning the fake i5 index should be AGATCTCGGT. However, based on your data, either the sequencing primer is 5'- AGATCGGAAGAGCGTCGTGTAGGGAAAGA -3' or your have an extra ACAC in the primer. I'm not sure ...

No matter what the real situation is, I still don't understand why it leads to GGGGGGGGGG sometimes ...

Anyway, I will leave this open in case other people have explanations :-)

Xi

shengzha commented 2 years ago

Many thanks for the great work! Just a minor question, isn't Truseq Index 1 sequencing primer GATCGGAAGAGCACACGTCTGAACTCCAGTCAC? Your current version has an extra leading A.

dbrg77 commented 2 years ago

Hi @shengzha , thanks for the comment.

I did not notice that. I simply thought it would be the whole thing before the index. The extra A at the 5' end should not matter though. Having checked the document from Illumina, it seems the exact sequence of TruSeq Index 1 sequencing primer is not provided. However, based on the information from other libraries on Page 73 from the Illumina Adapter Sequence guide (Document # 1000000002694 v16), you are correct. The exact sequence probably should be GATCGGAAGAGCACACGTCTGAACTCCAGTCAC.

I will correct this when I have time.

Thanks again.

Xi

dbrg77 commented 2 years ago

The Read 1 sequence has been corrected in all methods.