brwnj / fastq-multx

Demultiplexes a fastq.
41 stars 6 forks source link

Demultiplexing from barcodes in headers #7

Open jenzopr opened 6 years ago

jenzopr commented 6 years ago

Hi Joe,

I just run across a Segmentation fault. error, when demultiplexing from barcodes in the header. However, all %.fastq.gz files are created as empty files, so the error must occur afterwards. My call is fastq-multx -H -m1 -B barcodes.txt input.fastq.gz -o %.fastq.gz and a fastq header line looks like:

@NS500475:199:HHML2BGX2:1:11101:21358:1116 2:N:0:1 AACCAATCGT
GCGGTTAAGAGTACTGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
AAAA/EEEAEEEEEEEE###############################################################

Maybe you can provide me with some directions on how demultiplexing from a barcode in the header is possible. Best and many thanks! Jens

jenzopr commented 6 years ago

I investigated a bit and have an example to reproduce the error. It has nothing to do with barcodes in headers primarily, but with handling of unmatched barcodes in case of single-end:

The content of input.fastq:

@NS500475:199:HHML2BGX2:1:11101:12492:1053 1:N:0:1
TTGCAAGATCC
+
AAAAA#EEEEE
@NS500475:199:HHML2BGX2:1:11101:26088:1053 1:N:0:1
TCCAANCATCT
+
AAAAA#EEEEE
@NS500475:199:HHML2BGX2:1:11101:17308:1053 1:N:0:1
ATGCCNTAATT
+
6AAAA#EEAAA
@NS500475:199:HHML2BGX2:1:11101:23038:1053 1:N:0:1
TGCGTNGGCCG
+
AAAAA#EEEEE
@NS500475:199:HHML2BGX2:1:11101:6451:1053 1:N:0:1
TTGCANGCAT
+
AAAAA#AAEE

The barcodes.txt file:

cell1   TTGCAGTCTAC
cell2   TTGCAGTTATG
cell3   TTGCCTATGGC
cell4   TTGCAGCGTCC
cell5   TTGCAGGCATC
cell6   TTGATTGCTCG
cell7   TTGATGCAATC
cell8   TTGATTCTTAA
cell9   TTGATTCAGAT
cell10  TTGCAAGATCC

The call gives me:

fastq-multx -D -m 1 -B barcodes.txt input.fastq -o %.fastq.gz
BC: 0 bc:TTGCAGTCTAC n:11
BC: 1 bc:TTGCAGTTATG n:11
BC: 2 bc:TTGCCTATGGC n:11
BC: 3 bc:TTGCAGCGTCC n:11
BC: 4 bc:TTGCAGGCATC n:11
BC: 5 bc:TTGATTGCTCG n:11
BC: 6 bc:TTGATGCAATC n:11
BC: 7 bc:TTGATTCTTAA n:11
BC: 8 bc:TTGATTCAGAT n:11
BC: 9 bc:TTGCAAGATCC n:11
Using Barcode File: barcodes.txt
End used: start
id: @NS500475:199:HHML2BGX2:1:11101:12492:1053 1:N:0:1, seq: TTGCAAGATCC 11, found bc: 9 bc:TTGCAAGATCC n:11, bestd: 0, next_best: 3, best: 9 cell10
id: @NS500475:199:HHML2BGX2:1:11101:26088:1053 1:N:0:1, seq: TCCAANCATCT 11, best: 10 unmatched
Segmentation fault

The same error occurs with paired-end sequences when -o %_R1.fastq -o %_R2.fastq is used instead of -o n/a -o %.fastq. HTH, Jens

jenzopr commented 6 years ago

Hi Joe,

do you think you'll be able to fix the bug in the next few weeks? 😃

Best, Jens

brwnj commented 6 years ago

No, this would be a weekend/off-hours project and those are pretty booked these days.

jenzopr commented 6 years ago

Uh, that's bad news, but understandable. I haven't programmed C++ in a while, but I will try to have a look and dig around if you don't mind.

brwnj commented 6 years ago

I'll happily to review pull requests!

dlebron12 commented 2 years ago

was this ever fixed?

brwnj commented 2 years ago

I don't believe anyone ever prodded into this further.

rikrdo89 commented 1 year ago

I am also experiencing a similar issue when using the "-H" parameter for dual-indexes in the header. I always get Segmentation fault (core dumped)

rikrdo89 commented 1 year ago

I looked more into the issue, and as it has said before, it has nothing to do with the headers. The program cannot handle single-end reads. A way around this is to provide the input fastq file twice, and set n/a for one of the outputs, as follows:

fastq-multx -H -B indexes.txt mxtest-h_1.fastq mxtest-h_1.fastq -o %_1.fastq -o n/a

Hopefully some one will fix this issue at some point.