biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
26 stars 4 forks source link

Quadruple indexing, variable index length #35

Closed slhogle closed 2 years ago

slhogle commented 2 years ago

Hi,

We are running a variant of the Adapterama II protocol (http://doi.org/10.7717/peerj.7786) which uses a quadruple index approach (two truseq indexes, two custom indexes). The custom indexes have different lengths to increase the sequence diversity at each base position in the amplicon pools.

I am pretty sure I could adapt the tutorial for the quadruple approach if the custom indexes were the same length. However, I am unsure how to do this with variable custom index lengths.

Any advice/help on how to do this would be appreciated! Thanks!

-shane

moonwatcher commented 2 years ago

Hi Shane

The decoder, as it is coded today, requires the indexes to be of the same length. I mean within the group, you can have 4 different sets of barcodes in different locations with different lengths.

The issues is mostly that if some are in a different length it is not well defined how to compute the match. I assume that means some have a few extra bases. are those bases required to tell the barcodes apart? if yes wont it be possible that the sequence flanking the shorter barcodes happens to be similar to a longer one?

Is the set of 20 barcodes described in the paper, with lengths of 5-8, what you are after? Are the anchored at the same 5' position?

You can think of only the first 5 bases as being the barcode. Since you can't tell what will be downstream of the short ones, extra bases on the long ones don't actually contribute that much.

I might be able to allow Ns as fillers to make the barcodes the same length and adjust the posterior calculations accordingly.

slhogle commented 2 years ago

Hi,

Thanks for the quick reply.

In practice the sequences will look something like this:

Forward

p5   i5      Sequencing primer                  spacer  index  16S locus primers
---- ----    ---------------------------------    ---   -----  -----------------
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT          GGTAC  CCTACGGGNGGCWGCAG
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT      c   AACAC  CCTACGGGNGGCWGCAG
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT     at   CGGTT  CCTACGGGNGGCWGCAG
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT    tcg   GTCAA  CCTACGGGNGGCWGCAG
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT          AAGCG  CCTACGGGNGGCWGCAG
             ACACTCTTTCCCTACACGACGCTCTTCCGATCT      g   CCACA  CCTACGGGNGGCWGCAG

Reverse

p7   i7      Sequencing primer                  spacer  index  16S locus primers
---- ----    ----------------------------------   ---   -----  ---------------------
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT         AGGAA  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT     g   AGTGG  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT    cc   ACGTC  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT   ttc   TCAGC  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT         CTAGG  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT     t   GCTTA  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT    gc   GAAGT  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT   aat   CCTAT  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT         ATCTG  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT     g   AGACT  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT    cg   ATTCC  GACTACHVGGGTATCTAATCC
             GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT   tct   CAATC  GACTACHVGGGTATCTAATCC

So the actual index used to multiplex is the same length (5bp) however there is a spacer from 1-3 bp between some of the indexes to create extra diversity at the positions in the amplicon pool. The custom indexes will be anchored at the same 3' position after the 16S locus primer.

The p5|7 sequence is the same for everything however we will also need to demultiplex on the i5|7 Illumina indexes. Preferably we could do this in one pass using pheniqs?

Is the set of 20 barcodes described in the paper, with lengths of 5-8, what you are after?

Yes, that is exactly what I want to use to demultiplex.

Thanks again, -shane

moonwatcher commented 2 years ago

Oh,

Pheniqs tokens follow the python array slicing syntax. They can take either positive or negative offsets. In your case, just ignore the 0-3bp used for spacing and anchor the custom index to the 3' end. I assume those 5 bases have sufficient entropy for you to tell them apart. Either way, PAMLD, taking quality into account and computing the posterior, will be able to make the most of what ever entropy you have.

Please let me know if you need help defining those tokens. example 2.7 on the configuration page has examples of negative offsets and how they select bases anchored to the 3' end of a read segment.

To answer your questions: Pheniqs can decode all 4 barcodes in one run. You simply define 4 different decoders. The only limitation is that only one of the decoders can be used to split the reads into different files.

slhogle commented 2 years ago

Hi,

Sorry for the late reply. I was incorrect about the indexes being at a fixed position from the 3' end. What I ultimately, did was follow your suggestion from your first reply. In the first step, I demultiplex on the Illumina index pairs following your Illumina tutorial. In the second step then I demultiplexed on an 8 bp index at the first 8 positions of the read. For the shorter 5-7 bp indexes, I used the first 1-3 bp from the adjacent amplicon primer sequence to ensure that all indexes were the same length. Using this approach everything works as expected.

It took me a while to fully understand the workflow from the online documentation, which I think was likely the cause of my first question. But now that I have it mostly figured out it is pretty straightforward... One source of confusion was that I initially installed this from conda (bioconda channel) and the installed version was 2.0.4. As a result, the options for the software didn't actually correspond to the online documentation, which I am assuming is for the most recent version. After building from source the documentation fully corresponded to the software version I had built and everything made a lot more sense.

thanks, -shane

moonwatcher commented 2 years ago

Hi Shane

Great that you figured it out! If you want I can post your configuration on the site, it sounds like quite an interesting use case :)

I also think you can do all steps in one run. if you show me your configs I can help with that.

Lior.

slhogle commented 2 years ago

Hi Lior,

Sure - I made a simple git repo, to house scripts and config files to help the bioinformatics students in the group - https://github.com/slhogle/hambiDemultiplex.

Please have a look there, and yes, if you have a more efficient way to do this (like doing it all in one run) then please let me know! I am basically following the Fluidigm vignette on the webpage and doing the demultiplexing in two separate steps. In each step, I am estimating priors.

Thanks much, -shane