Closed adamcantor22 closed 3 years ago
Notice you will need to have specific tests for pheniqs working properly and for handling potential warnings/errors that pheniqs could output.
After looking into pheniqs more, I have some questions. I know we've discussed some of this before, but I still feel like I personally don't have a good answer. @cleme @adamcantor22
From @cleme : we are going to have one more study coming up that will use dual barcodes but we don't know what exact protocol they use, i.e. whether it is similar to what the MTC is using here or some other variation so I think we should have an option to indicate what barcode protocol we are using. Something like:
right now it is just single vs dual barcode but I think we might eventually need this more flexible type of approach so for now, we'd have three options: single bc EMP, dual barcode MTC, and dual barcode cutadatp where the last one (dual cutadapt) is the current version of the code
To get to your questions specifically @mstapylton
1. What cases are we wanting to cover with pheniqs? dual barcodes, dual barcodes but only if there are duplicates, etc
Dual barcodes when the study comes from MTC.
2. Is cutadapt working as expected in other scenarios? Qiime's cutadapt was finding lots of extra reads when the barcodes were at the beginning of the read and cutadapt was looking for the barcode anywhere in the read.
QIIME cutadapt does not work for dual barcodes. We expect that it does work for single barcodes, but this has not been formally tested, afaik. @DSWallach ?
3. What is the expected behavior? exact barcode matches plus some close matches in the resulting fastq files?
Default behavior will be exact barcodes plus 1 mismatch. We should parametrize to allow for more than a single mismatch or to get only exact barcodes, this is trivial setting default value to 1 in the function.
4. Is the pheniqs setting --platform something we'll need to parameterize in mmeds? and request from the user or determine in code i.e. illumina
Not sure what that parameter is used for. @adamcantor22 ?
Regarding #4
, no that should default to Illumina, shouldn't be necessary for us to change i think. The docs don't say much about it, and only have examples using Illumina. However the options for platform are CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, PACBIO
It seems to me that it'd be useful to have tests for all the barcode situations, but I know we also want to get the code merged sooner rather than later. Guess we can discuss during the MMEDs meeting
Capillary, 454, Solid, Helicos are outdated: we'd only want those if we go back to import older studies to MMEDS, as we discussed earlier today. IonTorrent and PacBio are rarely used for microbiome work, so probably not worth the effort.
When working with dual combined barcode reads, pheniqs should be used for demultiplexing instead of qiime. Implement this into the pipeline.