Closed slhogle closed 2 years ago
Hi Shane
The decoder, as it is coded today, requires the indexes to be of the same length. I mean within the group, you can have 4 different sets of barcodes in different locations with different lengths.
The issues is mostly that if some are in a different length it is not well defined how to compute the match. I assume that means some have a few extra bases. are those bases required to tell the barcodes apart? if yes wont it be possible that the sequence flanking the shorter barcodes happens to be similar to a longer one?
Is the set of 20 barcodes described in the paper, with lengths of 5-8, what you are after? Are the anchored at the same 5' position?
You can think of only the first 5 bases as being the barcode. Since you can't tell what will be downstream of the short ones, extra bases on the long ones don't actually contribute that much.
I might be able to allow Ns as fillers to make the barcodes the same length and adjust the posterior calculations accordingly.
Hi,
Thanks for the quick reply.
In practice the sequences will look something like this:
Forward
p5 i5 Sequencing primer spacer index 16S locus primers
---- ---- --------------------------------- --- ----- -----------------
ACACTCTTTCCCTACACGACGCTCTTCCGATCT GGTAC CCTACGGGNGGCWGCAG
ACACTCTTTCCCTACACGACGCTCTTCCGATCT c AACAC CCTACGGGNGGCWGCAG
ACACTCTTTCCCTACACGACGCTCTTCCGATCT at CGGTT CCTACGGGNGGCWGCAG
ACACTCTTTCCCTACACGACGCTCTTCCGATCT tcg GTCAA CCTACGGGNGGCWGCAG
ACACTCTTTCCCTACACGACGCTCTTCCGATCT AAGCG CCTACGGGNGGCWGCAG
ACACTCTTTCCCTACACGACGCTCTTCCGATCT g CCACA CCTACGGGNGGCWGCAG
Reverse
p7 i7 Sequencing primer spacer index 16S locus primers
---- ---- ---------------------------------- --- ----- ---------------------
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT AGGAA GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT g AGTGG GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT cc ACGTC GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT ttc TCAGC GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT CTAGG GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT t GCTTA GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT gc GAAGT GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT aat CCTAT GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT ATCTG GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT g AGACT GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT cg ATTCC GACTACHVGGGTATCTAATCC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT tct CAATC GACTACHVGGGTATCTAATCC
So the actual index used to multiplex is the same length (5bp) however there is a spacer from 1-3 bp between some of the indexes to create extra diversity at the positions in the amplicon pool. The custom indexes will be anchored at the same 3' position after the 16S locus primer.
The p5|7 sequence is the same for everything however we will also need to demultiplex on the i5|7 Illumina indexes. Preferably we could do this in one pass using pheniqs?
Is the set of 20 barcodes described in the paper, with lengths of 5-8, what you are after?
Yes, that is exactly what I want to use to demultiplex.
Thanks again, -shane
Oh,
Pheniqs tokens follow the python array slicing syntax. They can take either positive or negative offsets. In your case, just ignore the 0-3bp used for spacing and anchor the custom index to the 3' end. I assume those 5 bases have sufficient entropy for you to tell them apart. Either way, PAMLD, taking quality into account and computing the posterior, will be able to make the most of what ever entropy you have.
Please let me know if you need help defining those tokens. example 2.7 on the configuration page has examples of negative offsets and how they select bases anchored to the 3' end of a read segment.
To answer your questions: Pheniqs can decode all 4 barcodes in one run. You simply define 4 different decoders. The only limitation is that only one of the decoders can be used to split the reads into different files.
Hi,
Sorry for the late reply. I was incorrect about the indexes being at a fixed position from the 3' end. What I ultimately, did was follow your suggestion from your first reply. In the first step, I demultiplex on the Illumina index pairs following your Illumina tutorial. In the second step then I demultiplexed on an 8 bp index at the first 8 positions of the read. For the shorter 5-7 bp indexes, I used the first 1-3 bp from the adjacent amplicon primer sequence to ensure that all indexes were the same length. Using this approach everything works as expected.
It took me a while to fully understand the workflow from the online documentation, which I think was likely the cause of my first question. But now that I have it mostly figured out it is pretty straightforward... One source of confusion was that I initially installed this from conda (bioconda channel) and the installed version was 2.0.4. As a result, the options for the software didn't actually correspond to the online documentation, which I am assuming is for the most recent version. After building from source the documentation fully corresponded to the software version I had built and everything made a lot more sense.
thanks, -shane
Hi Shane
Great that you figured it out! If you want I can post your configuration on the site, it sounds like quite an interesting use case :)
I also think you can do all steps in one run. if you show me your configs I can help with that.
Lior.
Hi Lior,
Sure - I made a simple git repo, to house scripts and config files to help the bioinformatics students in the group - https://github.com/slhogle/hambiDemultiplex.
Please have a look there, and yes, if you have a more efficient way to do this (like doing it all in one run) then please let me know! I am basically following the Fluidigm vignette on the webpage and doing the demultiplexing in two separate steps. In each step, I am estimating priors.
Thanks much, -shane
Hi,
We are running a variant of the Adapterama II protocol (http://doi.org/10.7717/peerj.7786) which uses a quadruple index approach (two truseq indexes, two custom indexes). The custom indexes have different lengths to increase the sequence diversity at each base position in the amplicon pools.
I am pretty sure I could adapt the tutorial for the quadruple approach if the custom indexes were the same length. However, I am unsure how to do this with variable custom index lengths.
Any advice/help on how to do this would be appreciated! Thanks!
-shane