DReichLab / adna

Processing WGS aDNA data using the ReichLab protocol
12 stars 3 forks source link

aDNA-trim produces invalid merged FASTQ files in some cases #3

Open shyama-mama opened 6 months ago

shyama-mama commented 6 months ago

Hi Guys,

I am using aDNA trim as follows: seqtk mergepe R1.fastq R2.fastq | adna-trim -p aDNA_trim_pe - > aDNA_trim_merged.fastq The data is a NovaSeq sample pre-processed with FasP to trim polyG tails.

This is the original read R1

@A00488:28:HJ3THDSXX:2:1146:24189:25441 1:N:0:ACCAACT
TCCAGAGTTATTGCTGTGATACAGGCAGAGATGCTATAACTGAGTTTGTATTCTAGGGGGGGGGGGCCGATGTTAACGGGGAAAAGATAAAAATTTAACTTAATTGATACAGTGATATTAAATACGGACGAGCACACGACTAAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:,,,,:F,F,:,FF,:FF,,:,F,,FF:,,F,,FF,,:F,:,,,,F,,::,,,F:,,:FF,F,,F,F,F,F,F,::F,

R2

@A00488:28:HJ3THDSXX:2:1146:24189:25441 2:N:0:ACCAACT
CCTGTATCACTAAAGTTACATTATTATCTTTTCCCTGTTAACGTCGGGGGGGGCGGGGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFF,,FFFFFFF,FFFFFFFFF,F,,,,:,,,,F,F,:,,,,,,::FFF,::,,,:FF::,:F:FFF:,:FF:FF,FFF:,F:F,F,,::,,,FFF,FF,::,:,:,F:,FFFF,FF,,:F:FFF:F:FF:

Using aDNA trim on the fastq directly does not merge the reads. So I ran FastP to trim the PolyG tail. This is the modified fastq R1

@A00488:28:HJ3THDSXX:2:1146:24189:25441 1:N:0:ACCAACT
TCCAGAGTTATTGCTGTGATACAGGCAGAGATGCTATAACTGAGTTTGTATTCTAGGGGGGGGGGGCCGATGTTAACGGGGAAAAGATAAAAATTTAACTTAATTGATACAGTGATATTAAATACGGACGAGCACACGACTAAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:,,,,:F,F,:,FF,:FF,,:,F,,FF:,,F,,FF,,:F,:,,,,F,,::,,,F:,,:FF,F,,F,F,F,F,F,::F,

R2

@A00488:28:HJ3THDSXX:2:1146:24189:25441 2:N:0:ACCAACT
CCTGTATCACTAAAGTTACATTATTATCTTTTCCCTGTTAAC
+
FFFFFFFFFFFFFFFF,,FFFFFFF,FFFFFFFFF,F,,,,:

Using aDNA-trim produces the following invalid read.

@A00488:28:HJ3THDSXX:2:1146:24189:25441_22:*
,F,::F,
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:,,,,:F,F,F!FFFFFFFFF,FF;FFF;,FFFFF;F-FFFFFFFF;FFFFFFFFFFF:FFFFFFFFFFFF:,,,,:F,F,:,FF,:FF,,:,F,,FF:,,F,,FF,,:F,:,,,,F,,::,,,F:,,:FF,F,,F,F,F,F,F,::F,

The read should be able to be merged to produce a valid read. See FastP read below:

@A00488:28:HJ3THDSXX:2:1146:24189:25441 1:N:0:ACCAACT merged_113_0
TCCAGAGTTATTGCTGTGATACAGGCAGAGATGCTATAACTGAGTTTGTATTCTAGGGGGGGGGGGCCGATGTTAACGGGGAAAAGATAATAATGTAACTTTATTGATACAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFF:,,,,:F,F,:,FF,:FF,,:,F,FFF:F,F,,FFF,:F,:,,,,FF
MatthewMah commented 6 months ago

This repository is unsupported. The tool currently performing this merging process for the Reich lab is in https://github.com/DReichLab/ADNA-Tools

It's currently more cumbersome to perform the merge alone because demultiplexing is built into the merge step, but you can run it similar to this:

java -jar adnatools-1.11.6-SNAPSHOT.jar IndexAndBarcodeScreener --i5-indices indices --i7-indices indices --fixed-i5 AAAAAAA --fixed-i7 AAAAAAA -b barcodes -n 1 r1.gz r2.gz output_filename_stem

where indices is a file containing:

AAAAAAA AAAAAAA

and barcodes is an empty file.

Additionally, you may have to tweak the merge parameters because your sequence has 6 mismatches in the overlap region.

shyama-mama commented 6 months ago

Thanks @MatthewMah! Can you please advice me on how the mismatch-penalty-[max, high and low] are meant to be used. What are some parameters that give sensible outputs.

MatthewMah commented 5 months ago

The defaults are: maxPenalty = 3; mismatchPenaltyHigh = 3; mismatchPenaltyLow= 1; mismatchBaseQualityThreshold = 20; minOverlap = 15; minMergedLength = 30;