brwnj / fastq-multx

Demultiplexes a fastq.
44 stars 8 forks source link

how to demultiplex dual index on paried end reads #3

Closed y9c closed 7 years ago

y9c commented 7 years ago

index1-read1 --- read2-index2

sample1 index1-a index2-b sample2 index1-a index2-c sample3 index1-d index2-c

brwnj commented 7 years ago

I don't have any dual index data to test this on, but I believe for your barcode in the barcodes file you use something like:

sample_name  TAATGCGC-GTACTGAC

The second barcode is likely reverse complemented.

y9c commented 7 years ago

Thank you @brwnj.
If the reads are in separate files, as seq_R1.fq and seq_R2.fq. how to set up set up the command?

BTW, I wonder what is the relationship between this repo and ea-utils? Is the fastq-multx in ea-utils up to date?

brwnj commented 7 years ago

I don't know the command for sure. Re:

I don't have any dual index data to test this on

The relationship is such that this code is directly from ea-utils with slightly different versioning. The only changes present are to typos in the help message.

y9c commented 7 years ago

Hi @brwnj This is some test data. Would you please show me the code? Thank you very much.

barcode.txt test.1.fq.gz test.2.fq.gz

brwnj commented 7 years ago

Fix the barcodes as stated above:

awk 'BEGIN{FS=" ";OFS="\t"}!/^#/{print $1,$2"-"$3}' barcode.txt > fixed_barcodes.txt

Then:

fastq-multx -B fixed_barcodes.txt test.1.fq.gz test.2.fq.gz -o %_R1.fastq -o %_R2.fastq

The top bit of the output includes counts of:

Id Count File(s)
F111 36 F111_R1.fastq F111_R2.fastq
F114 9 F114_R1.fastq F114_R2.fastq
F121 10 F121_R1.fastq F121_R2.fastq
F124 16 F124_R1.fastq F124_R2.fastq
F131 14 F131_R1.fastq F131_R2.fastq
F134 21 F134_R1.fastq F134_R2.fastq
F141 31 F141_R1.fastq F141_R2.fastq
F144 16 F144_R1.fastq F144_R2.fastq
y9c commented 7 years ago

the second barcode is not reverse complemented.

brwnj commented 7 years ago

There is a problem, but that's not it. fastq-multx is matching barcodes in the sequence line only and not the header. Using -H, which should use the header, causes a seg fault.

I would recommend trying out Brian Bushnell's demuxbyname.sh method outlined here: https://www.biostars.org/p/139395/.

y9c commented 7 years ago

some note:

If the sequence orientation is undetermined, use this barcode list to demultiplex the file.

awk '!/^#/{print $1"\t"$2"-"$3"\n"$1"\t"$3"-"$2}' barcode.txt > fixed_barcodes.txt

Dual barcode should in the format as barcode1-barcode2.

Write barcode sequence is in the original orientation, and shouldn't reverse barcode2.

y9c commented 7 years ago

@brwnj
the second read is not trimed..

y9c commented 6 years ago

@brwnj

Any progress on this?

brwnj commented 6 years ago

Progress? Prove to me that these reads are dual-indexed.

You can clearly see the reads coming off the sequencer have the same index per sequence:

@HWI-D00523:240:HF3WGBCXX:1:1116:1699:4861 1:N:0:CCTCCT
@HWI-D00523:240:HF3WGBCXX:2:2212:6141:20342 1:N:0:CCGTGA
@HWI-D00523:240:HF3WGBCXX:1:2101:18265:67898 1:N:0:CCTCCT

@HWI-D00523:240:HF3WGBCXX:1:1116:1699:4861 2:N:0:CCTCCT
@HWI-D00523:240:HF3WGBCXX:2:2212:6141:20342 2:N:0:CCGTGA
@HWI-D00523:240:HF3WGBCXX:1:2101:18265:67898 2:N:0:CCTCCT
y9c commented 6 years ago

@brwnj I mean the bug that barcode in read 2 is not trimmed.

hepcat72 commented 5 years ago

So let me see if I'm inferring correctly here from this issue thread... Dual barcodes in separate index files can be demuxed by concatenating the sequences in the 2 index files and then supply the barcodes in the barcode file as "ID\tBC1-BC2\n"?