bailey-lab / SeekDeep

Bioinformatic Tools for analyzing targeted amplicon sequencing developed by Nicholas Hathaway of Bailey Lab
http://seekdeep.brown.edu/
GNU Lesser General Public License v3.0
13 stars 5 forks source link

Extracting paired-end reads with primer pairs and no multiplexing #7

Closed embatty closed 6 years ago

embatty commented 6 years ago

I am using SeekDeep to extract the reads from some amplicon sequencing. The sequences have already been demultiplexed into a pair of files per sample, and I want to split these into files for each of my primer pairs. I'm running the following command: SeekDeep extractorPairedEnd --id barcodes.txt --fastq1gz sample.1.fastq.gz --fastq2gz sample.2.fastq.gz --overlapStatusFnp overlapStatus.txt --dout sample

with the barcodes.txt looking like this:

gene    forward reverse
Pf-1    TTTCTATGACGTATGATAGGG   TTATATTATTTGCTGTCAGATTATT
Pf-2    ATGTCTAAAGATAATATAGGAAATAAA ATAAAATTCATTTGTGTCTTTTT
Pf-3    GAAAATATGGTAGGTGATTTAAGAA   AGTAGCAATATTTGCATCAACA
Pf-4    AGAATATAAAAATTTTGAGAATGATAAA    TTAAATATTCTACACCATCAAATCC

and overlapStatus.txt:

target status
Pf-1 noOverLap
Pf-2 noOverLap
Pf-3 noOverLap
Pf-4 noOverLap

I get the following error:

void bibseq::SeqInput::openIn(): Error file: sample/unfilteredReads/byPrimers/Pf-1all_R1.fastq.gz doesn't exist
and file: sample/unfilteredReads/byPrimers/Pf-1all_R2.fastq.gz doesn't exist

I think it's correctly checking that I have no MIDs and trying to assign one called "all" but then can't find the file it needs. Do you have any idea where I'm going wrong?

Owuorgpo commented 6 years ago

The main purpose of the extractor is to take raw data and demultiplex by sample barcodes if present and by primer pairs. In your case, the sequences have already been demultiplexed into a pair of file per sample and your interest is to simply split based on the primer pair. But if you have MIDs, you should include them in the barcodes.txt after the primers with a header line starting with id and barcode: id barcode MID01 ACGAGTGCGT MID02 ACGCTCGACA MID03 AGACGCACTC MID04 AGCACTGTAG If there are no MIDs, please follow this tutorial http://baileylab.umassmed.edu/SeekDeep/tutorial_PairedEnd_noMIDs.html

nickjhathaway commented 6 years ago

Hi @embatty, Sorry for the delay, I'm currently finishing up Med school and haven't been able to respond right away. I can't see any obvious reason for why the command would have failed. I would need to see what managed to be created in that output directory to get a better idea of what was happening, or if possible if you could send me the sequence files where the command fails, I could likely see right away where things are going wrong. Feel free to share an example file or something over dropbox for me to take a look.

Best, Nick

nickjhathaway commented 6 years ago

Actually @embatty, I figured it out, I almost never use SeekDeep extractorPairedEnd without using the flag --sampleName to give a name to the files and when I had changed something in the last version I never checked to see if extractor worked without having the --sampleName flag on. I've just released a new version where I have fixed this issue. Let me know if you have any other problems.

Best, Nick