brwnj / fastq-multx

Demultiplexes a fastq.
42 stars 6 forks source link

headers in unmatched file #5

Open willowsblade opened 7 years ago

willowsblade commented 7 years ago

Hi,

I have paired-end Illumina reads where the barcode is at the start of the read in either the R1 file OR in the R2 file. To demultiplex, my thought was to run fastq-multx looking first for the barcode in the R1 file, then to repeat with the unmatched reads looking for the barcode in the R2 file. Unfortunately, fastq-multx appears to be adding the full sequence of the read to the header of each read. Is there anyway to prevent this? There does not seem to be an issue with the headers in the successfully demultiplexed reads.

I am running the command as follows: fastq-multx -B barcode_file_plate1.txt Europe_R1_001.fastq Europe_R2_001.fastq -m 1 -o R1_has_barcode/R1.%.fastq -o R1_has_barcode/R2.%.fastq

Thanks!

brwnj commented 7 years ago

If you're not using normal library prep with a standard index read, then you ought to be prepared to write the informatics tool to deal with the consequences of those upstream decisions. That said, what do you get if you also use -x so that any identified barcode is not trimmed?

willowsblade commented 7 years ago

This is an older dataset, so I didn't get to make any of the upstream decisions. I have trimmed out the sequence from the header and continued, but I didn't think adding the sequence to the header was intentional.

The problem gets worse when I use the -x option, now instead of just having the sequence in the header of the unmatched reads files, I have the sequence in the header on all the files I checked.

Here's an example of what the data looks like after running:

@M00384:73:000000000-A7D5C:1:1101:18170:2200 1:N:0:1 GTACCAACGTGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGCTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTCCGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCGATGGCGAAGGCAGGTCTCTGGGCCGTCACTGACG
GTACCAACGTGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTATCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGCTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTCCGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCGATGGCGAAGGCAGGTCTCTGGGCCGTCACTGACG
+
AAAA?FBAAAA1AFF1FFGA0AEAGA00/DFG/EAEEGHGGGAEGEEECCGF?@E?EH12FE1/0>EE>/FF?>E?/>/>/1>////B/CBBC@<@<@-<.>111<1=F1DB?.<CCBGC-:EGA0C000@@.@@GB-.;A@@@;-9-:A-F9FF?;---AA-A--A/BEFF/;/B/A@-@BFFFB-AAF-BFF9B/F/F//BFF@@@-AF/B/BF<?<A--;-9=B--9--9B/F//BE-9--9;B9F/9

The barcode for this sample was GTACCAAC, if that matters.

brwnj commented 7 years ago

That's not good! It's adding what has been determined to be the barcode. You're still using -B barcode_file_plate1.txt, right? I would also probably add -b.

Is this a public dataset that I could access?

willowsblade commented 7 years ago

Yes, I'm still including -B barcode_file_plate1.txt, I added the -x in front of the -m 1 in the earlier, does location of the flags matter?

Unfortunately, the dataset is not published :( I can test this with one of the datasets we have that is public, and see if I can replicate the issue.

brwnj commented 7 years ago

-x is definitely what you want, though you will want to clean up that awful header at some point. Most aligners will strip at the first whitespace anyways. You will also need to trim the barcode length from the start of each read to remove the remaining barcode.

Your barcode file is comprised of rows of samplebarcode with proper line endings? Sometimes exporting from Excel will not give the desired line ending which can be fixed in place with:

perl -pi -e "s/\r/\n/g" barcode_file_plate1.txt