GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Read pair detection failure from biscuit-aligned reads #30

Closed tantrev closed 4 years ago

tantrev commented 7 years ago

I was just trying to run Samblaster on some output from Biscuit, and was given the following error:

samblaster: Version 0.1.24
samblaster: Inputting from stdin
samblaster: Outputting to stdout
samblaster: Loaded 66 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: SRR892990.1.1
samblaster:    At location: 5:46377045
samblaster:    Are you sure the input is sorted by read ids?

A small test dataset to replicate the issue may be found here.

GregoryFaust commented 7 years ago

Your read ids each have a ".1" or ".2" at the end, presumably to indicate the first and second read in a read-pair. I don't know if these are being added by Biscuit or not (I don't have samples of your input fastq files). samblaster expects the first and second read in each pair to have the identical readid. Quoting from the SAM format specification document:

"Reads/segments having identical QNAME are regarded to come from the same template.

tantrev commented 7 years ago

Thank you for pointing that out! Yes, I generated the .fastq using the fastq-dump program from .sra files with the "--readids" option, as suggested by the Edwards Lab here :

This is pretty much what every program that accepts paired-end reads expects for input, so if you use –split-spot or –split-files you should use –readids. You should.

I'll make sure to disable it from now on though.

GregoryFaust commented 7 years ago

Let me look into this further. In my experience, the fastq files that come directly from the standard Illumina sequencing post-processing software never have these type read-ids. Instead, the first and second read of each pair have identical read-ids but are in different files (as you created using the split-files option). Then BWA (for example) leaves the read ids (QNAMEs) identical from the first and second in the pair in their output with the SAM/BAM flags set to distinguish which read is which in the pair. All the downstream software that I am aware of can handle such SAM/BAM files just fine including GATK, Picard, sambamba, samtools, IGV, etc. etc. In particular, I will check to see if BWA will accept such readids during alignment from split files, and what it does with them in the SAM output.

ronin-gw commented 5 years ago

Hi. I got the same issue when I tried to use samblaster with bowite2 and I noticed that BWA truncates read' names to match their QNAMEs.

For example, hare are the heads of the fastq files I got from the ENCODE.

$ cat 1.fastq
@BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/1 1:X:0:CATAGCGA
TAGGGTTAGGGTTAGGGTTAGGGTT
+
@@@FFADDFHHCFHJJJCGDHIJGG
$ cat 2.fastq
@BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/2 1:X:0:CATAGCGA
TAACCCTAACCCTAACCCTAACCCT
+
CCCFFFFFHHHHHJJIJIIJJJJII

And the mapping results with BWA and bowtie2 are:

$ bwa mem path/to/index 1.fastq 2.fastq | tail -n 2
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861  77  *   0   0   *   *   0   0   TAGGGTTAGGGTTAGGGTTAGGGTT   @@@FFADDFHHCFHJJJCGDHIJGG   AS:i:0  XS:i:0
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861  141 *   0   0   *   *   0   0   TAACCCTAACCCTAACCCTAACCCT   CCCFFFFFHHHHHJJIJIIJJJJII   AS:i:0  XS:i:0

$ bowtie2 --no-head -x path/to/index -1 1.fastq -2 2.fastq
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/1    83  chr18   10112   1   25M =   10041   -96 AACCCTAACCCTAACCCTAACCCTA   GGJIHDGCJJJHFCHHFDDAFF@@@   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:25 YS:i:0  YT:Z:CP
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/2    163 chr18   10041   1   25M =   10112   96  TAACCCTAACCCTAACCCTAACCCT   CCCFFFFFHHHHHJJIJIIJJJJII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:25 YS:i:0  YT:Z:CP

I couldn't find how BWA fixes QNAMEs but other aligners might not fix QNAMEs against the specification.

GregoryFaust commented 5 years ago

I don't understand why bowtie2 is appending the "/1" and "/2" to the readids. To my knowledge, this is not in acccordance with the SAM format spec as indicated above. samblaster does not do anything special with readids in the input SAM file except look for both reads in a pair (with default parameters) by checking for identical QNAMEs and the appropriate tag field bits set. In particular, the QNAME is allowed to have spaces in it, just not tab characters. samblaster and all other tools I know treat SAM file fields as tab delimited as per the spec. Therefore, it seems like more tools than samblaster will have problems with any appended extensions to the QNAME, whether it is ".1" and ".2" or "/1" and "/2". Do you not find this to be the case?

ronin-gw commented 5 years ago

I finally realized my problem was caused by bowtie2 and the unmatched read ids may be an unexpected result. I also agree with your opinion. Different QNAMEs for a pair must cause problems in the following analysis. The reason I had not noticed the problem may some tools used for read abundance based analyses (like ChIP-seq, etc.) look only the first read in each pair because they just use the positions of reads and both two lines for a pair contain identical information except their sequences. Anyway, thanks for your reply.