Closed tantrev closed 4 years ago
Your read ids each have a ".1" or ".2" at the end, presumably to indicate the first and second read in a read-pair. I don't know if these are being added by Biscuit or not (I don't have samples of your input fastq files). samblaster expects the first and second read in each pair to have the identical readid. Quoting from the SAM format specification document:
"Reads/segments having identical QNAME are regarded to come from the same template.
Thank you for pointing that out! Yes, I generated the .fastq using the fastq-dump program from .sra files with the "--readids" option, as suggested by the Edwards Lab here :
This is pretty much what every program that accepts paired-end reads expects for input, so if you use –split-spot or –split-files you should use –readids. You should.
I'll make sure to disable it from now on though.
Let me look into this further. In my experience, the fastq files that come directly from the standard Illumina sequencing post-processing software never have these type read-ids. Instead, the first and second read of each pair have identical read-ids but are in different files (as you created using the split-files option). Then BWA (for example) leaves the read ids (QNAMEs) identical from the first and second in the pair in their output with the SAM/BAM flags set to distinguish which read is which in the pair. All the downstream software that I am aware of can handle such SAM/BAM files just fine including GATK, Picard, sambamba, samtools, IGV, etc. etc. In particular, I will check to see if BWA will accept such readids during alignment from split files, and what it does with them in the SAM output.
Hi. I got the same issue when I tried to use samblaster with bowite2 and I noticed that BWA truncates read' names to match their QNAMEs.
For example, hare are the heads of the fastq files I got from the ENCODE.
$ cat 1.fastq
@BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/1 1:X:0:CATAGCGA
TAGGGTTAGGGTTAGGGTTAGGGTT
+
@@@FFADDFHHCFHJJJCGDHIJGG
$ cat 2.fastq
@BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/2 1:X:0:CATAGCGA
TAACCCTAACCCTAACCCTAACCCT
+
CCCFFFFFHHHHHJJIJIIJJJJII
And the mapping results with BWA and bowtie2 are:
$ bwa mem path/to/index 1.fastq 2.fastq | tail -n 2
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861 77 * 0 0 * * 0 0 TAGGGTTAGGGTTAGGGTTAGGGTT @@@FFADDFHHCFHJJJCGDHIJGG AS:i:0 XS:i:0
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861 141 * 0 0 * * 0 0 TAACCCTAACCCTAACCCTAACCCT CCCFFFFFHHHHHJJIJIIJJJJII AS:i:0 XS:i:0
$ bowtie2 --no-head -x path/to/index -1 1.fastq -2 2.fastq
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/1 83 chr18 10112 1 25M = 10041 -96 AACCCTAACCCTAACCCTAACCCTA GGJIHDGCJJJHFCHHFDDAFF@@@ AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:25 YS:i:0 YT:Z:CP
BI:SL-HEA:C3BFGACXX:2:2316:10503:45861/2 163 chr18 10041 1 25M = 10112 96 TAACCCTAACCCTAACCCTAACCCT CCCFFFFFHHHHHJJIJIIJJJJII AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:25 YS:i:0 YT:Z:CP
I couldn't find how BWA fixes QNAMEs but other aligners might not fix QNAMEs against the specification.
I don't understand why bowtie2 is appending the "/1" and "/2" to the readids. To my knowledge, this is not in acccordance with the SAM format spec as indicated above. samblaster does not do anything special with readids in the input SAM file except look for both reads in a pair (with default parameters) by checking for identical QNAMEs and the appropriate tag field bits set. In particular, the QNAME is allowed to have spaces in it, just not tab characters. samblaster and all other tools I know treat SAM file fields as tab delimited as per the spec. Therefore, it seems like more tools than samblaster will have problems with any appended extensions to the QNAME, whether it is ".1" and ".2" or "/1" and "/2". Do you not find this to be the case?
I finally realized my problem was caused by bowtie2 and the unmatched read ids may be an unexpected result. I also agree with your opinion. Different QNAMEs for a pair must cause problems in the following analysis. The reason I had not noticed the problem may some tools used for read abundance based analyses (like ChIP-seq, etc.) look only the first read in each pair because they just use the positions of reads and both two lines for a pair contain identical information except their sequences. Anyway, thanks for your reply.
I was just trying to run Samblaster on some output from Biscuit, and was given the following error:
A small test dataset to replicate the issue may be found here.