Implement pheniqs-based demultiplexing solution

adamcantor22 commented 3 years ago

When working with dual combined barcode reads, pheniqs should be used for demultiplexing instead of qiime. Implement this into the pipeline.

cleme commented 3 years ago

Notice you will need to have specific tests for pheniqs working properly and for handling potential warnings/errors that pheniqs could output.

mstapylton commented 3 years ago

After looking into pheniqs more, I have some questions. I know we've discussed some of this before, but I still feel like I personally don't have a good answer. @cleme @adamcantor22

What cases are we wanting to cover with pheniqs? dual barcodes, dual barcodes but only if there are duplicates, etc
Is cutadapt working as expected in other scenarios? Qiime's cutadapt was finding lots of extra reads when the barcodes were at the beginning of the read and cutadapt was looking for the barcode anywhere in the read.
What is the expected behavior? exact barcode matches plus some close matches in the resulting fastq files?
Is the pheniqs setting --platform something we'll need to parameterize in mmeds? and request from the user or determine in code i.e. illumina

adamcantor22 commented 3 years ago

From @cleme : we are going to have one more study coming up that will use dual barcodes but we don't know what exact protocol they use, i.e. whether it is similar to what the MTC is using here or some other variation so I think we should have an option to indicate what barcode protocol we are using. Something like:

single barcode, EMP
dual barcode MTC
dual barcode XXX

right now it is just single vs dual barcode but I think we might eventually need this more flexible type of approach so for now, we'd have three options: single bc EMP, dual barcode MTC, and dual barcode cutadatp where the last one (dual cutadapt) is the current version of the code

cleme commented 3 years ago

To get to your questions specifically @mstapylton

1. What cases are we wanting to cover with pheniqs? dual barcodes, dual barcodes but only if there are duplicates, etc Dual barcodes when the study comes from MTC.

2. Is cutadapt working as expected in other scenarios? Qiime's cutadapt was finding lots of extra reads when the barcodes were at the beginning of the read and cutadapt was looking for the barcode anywhere in the read. QIIME cutadapt does not work for dual barcodes. We expect that it does work for single barcodes, but this has not been formally tested, afaik. @DSWallach ?

3. What is the expected behavior? exact barcode matches plus some close matches in the resulting fastq files? Default behavior will be exact barcodes plus 1 mismatch. We should parametrize to allow for more than a single mismatch or to get only exact barcodes, this is trivial setting default value to 1 in the function.

4. Is the pheniqs setting --platform something we'll need to parameterize in mmeds? and request from the user or determine in code i.e. illumina Not sure what that parameter is used for. @adamcantor22 ?

adamcantor22 commented 3 years ago

Regarding #4, no that should default to Illumina, shouldn't be necessary for us to change i think. The docs don't say much about it, and only have examples using Illumina. However the options for platform are CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, PACBIO

mstapylton commented 3 years ago

It seems to me that it'd be useful to have tests for all the barcode situations, but I know we also want to get the code merged sooner rather than later. Guess we can discuss during the MMEDs meeting

cleme commented 3 years ago

Capillary, 454, Solid, Helicos are outdated: we'd only want those if we go back to import older studies to MMEDS, as we discussed earlier today. IonTorrent and PacBio are rarely used for microbiome work, so probably not worth the effort.

clemente-lab / mmeds-meta

Implement pheniqs-based demultiplexing solution #319