GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Can't find first and/or second of pair error #45

Open dheerajbobbili1988 opened 4 years ago

dheerajbobbili1988 commented 4 years ago

Hi,

I am trying to realign a bam file to a new reference and in the process I would like to use samblaster. When I use the command below, I am running into this error.

samblaster: Loaded 84 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: SRR622461.3665340
samblaster:    At location: *:0
samblaster:    Are you sure the input is sorted by read ids?samblaster: Exiting early, the following stats are for processing preceeding the error
samblaster: Marked           0 of        297 (0.000%) total read ids as duplicates using 1620k memory in 0.001S CPU seconds and 13M33S(813S) wall time.
samblaster: Premature exit (return code 1).

Here is my command

samtools collate -uOn128 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.collate | samtools fastq - | bwa mem -pt20 -R  '@RG\tID:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tLB:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tSM:NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211\tPL:ILLUMINA' -M human_g1k_v37_Ensembl_MT_66.fasta - | samtools sort --threads=20 -m4G -n -O sam | samblaster -M | samtools view -Sb - > NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.marked.bam

Now, my question is in this scenario can I safely add "--ignoreUnmated" flag or is samblaster is not suited for this purpose. Please let me know.

GregoryFaust commented 4 years ago

I think it is fine to try samblaster here with the --ignoreUnmated option. I see two issues:

1) Since it appears you have pulled reads only from chrom20, there are bound to be reads in the input that have their mate aligned to a different chromosome in the original reference. This is what is probably causing your unmated reads. I guess the fact that the first unmated read shown appears to be unaligned is due to the change in reference. Also, I hope the samtools fastq command will work on such input with unmated pairs. In the samblaster output stats, you should see the number of unmated pairs as a percent of all pairs and you can then judge if you think this chrom20 selection is indeed the issue. I suggest redirecting stderr in the samblaster command to capture these stats.

2) You don't need to sort into name order before using samblaster to mark duplicates. The samtools collate command will already make the input "read-id grouped" which is all samblaster requires, and BWA will not change the order of the reads in the output.