biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
24 stars 4 forks source link

demultiplexing based on primer #31

Open FabianRoger opened 2 years ago

FabianRoger commented 2 years ago

Hi,

I just discovered the tool and used it for de multiplexing of an Illumina MiSeq run, it seemed to work well! (only 6% of reads with unclear barcodes for 91 pooled samples).

On the same run however, I also pooled to differnt primer pairs. The Primer sequences are part of the biological read.

I was wondering if pheniqs could be used for that? I use cutadapt but the PHRED error model would be a nicer way than just allowing for a certain number of mismatches.

edit: for primer search, IUPAC wildcard characters would ideally need to be supported

moonwatcher commented 2 years ago

Hi Fabian

I am not sure I understand what you are trying to achieve. Are you trying to remove the primer from the biological read or classify by it?

Maybe you can provide an example?

FabianRoger commented 2 years ago

classify by primer (removing would be the next step although it could be done in one). It is quite common that we pool amplicons generated with different primer pairs onto one run. We usually do not index these separately as the primer sequences serve as indicies.

So after demultiplexing, my files look like this

file_index1_R1:

PRIMER_F_1_READ PRIMER_F_2_READ PRIMER_F_1_READ PRIMER_F_1_READ PRIMER_F_2_READ

file_index1_R2:

PRIMER_R_1_READ PRIMER_R_2_READ PRIMER_R_1_READ PRIMER_R_1_READ PRIMER_R_2_READ

what I want is file_index1_PRIMER1_R1, file_index1_PRIMER1_R2 and file_index1_PRIMER2_R1 file_index1_PRIMER2_R2

Right now, we (and most people I assume), use cutadapt for this and it works well. However it doesn't take into account the quality scores but just allows for a certain number of miss-matches (10 % be default). I think that the PAMLD decoder could be a brilliant alternative.

moonwatcher commented 2 years ago

So, if I understand this correctly: you have two possible primers and 91 sample barcodes. The "libraries" are characterized by the dot product, so 2x91 libraries.

Pheniqs only allows one sample barcode decoder (since sample barcodes represent read groups and those have their own rules and life cycle) BUT every decoder in pheniqs (sample, cellular or molecular) can have an unlimited number of segments. That means you can treat your sample barcode as a barcode with two segments and encode the full dot product (so you will declare 91x2 barcodes).

for instance

"primer_1_library_1": {
    "barcode": [
        "<your barcode sequence for library 1>",
        "<your first primer sequence>"
    ]
},
"primer_2_library_1": {
    "barcode": [
        "<your barcode sequence for library 1>",
        "<your second primer sequence>"
    ]
},

You will, of course, need to define tokens for locating the primer sequence. If this sounds confusing just upload a sample config file you made and an example read with the location of the primers and the two sequences and I will help you with the config but it sounds doable.

P.S. Since primers are usually longer than barcodes, you can expect posterior decoding to have significantly less false negatives (loosing reads) without risking more false positives (misclassifying reads). From a probabilistic approach telling two long sequences apart with high confidence is possible even with a lot of mismatches. would be interesting though to see what the noise filter thinks about it, but I suspect that won't be an issue either.

P.S.2 Of course pheniqs can easily also remove it from the output, you have absolute control of what comes out in the output.

moonwatcher commented 2 years ago

Fabian

Could you please also post your two primers? do they have ambiguous nucleotides? the statistical core that computes the posterior will have to be extended to support those.

FabianRoger commented 2 years ago

in my case I use the following primer pairs:

PRIMER 1 Fwrd: GGWTGAACWGTWTAYCCYCC Rev: TAAACTTCAGGGTGACCAAAAAATCA

PRIMER 2 Fwrd: CGAGAAGACCCTATGGAGCTTA Rev: AATCGTTGAACAAACGAACC

Note that the first forward primer does indeed contain ambiguous nucleotides (but in my case no Iosine - which could be teh case for other primers)

moonwatcher commented 2 years ago

So those primers show up on a fixed offset from some edge of the read segments? Is the only thing you are missing is support for ambiguity codes or do they also shift in position? Do they have indels? in which case this becomes more of an alignment issue...

I think adding support for ambiguity codes in the barcode declaration is possible, the statistical model can definitely apply to it. Basically a W means that both A and T should be considered a positive match.

Have you tried what I suggested? does it work for you other than the ambiguity code? (you can try and do a run with just an A or just a T to see that it otherwise works).

What are you r thoughts?

FabianRoger commented 2 years ago

sorry I just saw you suggestion, I had only seen your question about the barcode sequence. I will try it out later this week and come back to you. I am not sure there can be indels in the primer binding site but primer slippage might lead to that (see this paper ).

I'll get back to you!