Closed jayoung closed 4 years ago
Hi Janet,
Ouch! That's an ugly one.
Compared to Linux your Mac is such a whiner ;-)
The workhorse behind readGalignmentPairs()
is the scanBam()
function from the Rsamtools package. And the workhorse behind scanBam()
is a complex piece of C and C++ code in Rsamtools that is built on top of the HTSlib C library. The fact that you don't get the crash when you do readGAlignments()
suggests that the problem is in the C++ code that implements the pairing algo (samread_mate()
and bam_fetch_mate()
functions in Rsamtools/src/bam_mate_iter.cpp
).
This is how readGalignmentPairs()
uses scanBam()
to load the paired data:
library(Rsamtools)
path <- "mapToCombinedFlyYeastHumanGenomes_ATfiltReads/just_dm6_reads/zz_testR/Nnk_chip_rep3.ATfilt.combined.dm6.chr4.bam"
flags <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, hasUnmappedMate=FALSE)
what <- c("flag", "rname", "strand", "pos", "cigar", "groupid", "mate_status")
param <- ScanBamParam(flag=flags, what=what)
res1 <- scanBam(BamFile(path), param=param) # should work
res2 <- scanBam(BamFile(path, asMates=TRUE), param=param) # should crash
Can you reproduce the crash with the above code? This would confirm that the crash happens in Rsamtools/src/bam_mate_iter.cpp
.
Thanks!
hi Herve,
thanks for looking into this. yes, it's ugly!
On linux: yes, your prediction is right:
res1 <- scanBam(BamFile(path), param=param) # works
# (although I didn't get all 7 expected fields from 'what' in the output - just got 5: not sure if that's relevant):
names(res1[[1]])
# [1] "flag" "rname" "strand" "pos" "cigar"
res2 <- scanBam(BamFile(path, asMates=TRUE), param=param) # crashes ("Floating point exception (core dumped)" )
Interestingly, on the Mac, the second command sometimes works fine but other times crashes R:
res2 <- scanBam(BamFile(path, asMates=TRUE), param=param)
when it does work it gives me all 7 fields from the 'what' list. But repeating the same command maybe 10 seconds later makes R crash. Weird.
I'll send you a direct message with a path to the bam file (and index) that you should be able to access if you want to see if it reproduces for you.
Janet
Thanks Janet. I'm not familiar with the code in Rsamtools/src/bam_mate_iter.cpp
so I'm hoping that @mtmorgan can help with this. Yes Martin will probably need access to the file.
hi Martin,
I just emailed Herve (at your fhcrc address - do you still use that?) with a way to access the bam file. Herve - would you mind forwarding that to Martin? thanks. The only email address I have for you Martin is the old Hutch address - it's been a while now!
Janet
Yes I got your email and the link works. Thanks for sending that. The file is relatively small (21M) which is good. I'll forward to Martin.
Yep, and I also get the Floating point exception (core dumped) crash on my laptop (Ubuntu 18.04) with BioC release and devel.
Thinking about it, I realize that I derived this bam file in an unconventional way (long story). I wonder if I got something corrupted? this approach hasn't caused me any issues on ~50 other bam files (although the others were smaller - maybe now I have more reads, I've hit some unusual problem).
I'll see if I can figure out where the problem comes by subsetting the bam file.
I narrowed the problem down to a single pair of reads (see problemPair.bam in the same location I put the bigger file). Currently thinking about what's wrong with that pair.
OK, picard's ValidateSamFile is telling me problemPair.bam is messed up (in a couple of ways that make sense, given how I got that bam file):
ERROR: Read groups is empty
ERROR: Record 2, Read name A00887:10:HCKWTDRXX:2:2138:25970:20509, Mate Alignment start should be 0 because reference name = *.
ERROR: Record 2, Read name A00887:10:HCKWTDRXX:2:2138:25970:20509, Mapped mate should have mate reference name
ERROR: Record 2, Read name A00887:10:HCKWTDRXX:2:2138:25970:20509, The unaligned mate start position is 768672, should be 0
ERROR: Read name A00887:10:HCKWTDRXX:2:2138:25970:20509, Mate not found for paired read
I'll rethink how my upstream pipeline works so I can fix this. I wonder if there's a way to have R complain about the bam file, rather than totally crash the whole session? This is probably a rare case, so please don't spend a lot of time on it.
For context, if you're interested, here's why I have a weird pipeline. I'm working with ChIP-seq data from Drosophila cells. For some other datasets, spike-in DNA from human and/or yeast was added. So I initial mapped reads to a weird combined reference genome (fly+yeast+human), with each reference chromosome name tagged so I know which genome it comes from - e.g. hg19_chr1 is my name for human chr1, dm6_chr4 for fly chr4 etc).
I use those initial bam files to count how many reads map to the spike-in genomes, but after that I want "normal" bam files that contain only reads on the fly chromosomes, and I want those chromosomes renamed to be chr4 instead of dm6_chr4. I did that on the command line (combination of samtools view and sed to change chromosome names, and manually getting the header) - I think that's where the problem crept in.
Perhaps I'm better off mapping to each of the three genomes separately - it just seemed like a good idea so that I should get less cross-mapping to the wrong genome (each read gets a chance to map to the true correct location when I map to combined genomes, but not if I map separately).
Great! This is VERY useful, thanks! What causes the crash is an invalid RNEXT value in the file. I forged a very simple SAM file that reproduces the problem.
I think that what could happen with your combined reference genome strategy is that some records loose their mate during the split. Did you get the following message when you generated your "normal" BAM files?
[W::sam_parse1] urecognized mate reference name; treated as unmapped
More precisely: For some rare reads it could be that the 2 ends of the read each get mapped to a different genome. This would give you SAM records where e.g. RNAME and RNEXT are dm6_chr4
and hg19_chr22
. Then when you split the file into genome-specific SAM files (based on the value of RNAME I guess) and rename the chromosomes, you end up with a record where RNAME and RNEXT are now chr4
and hg19_chr22
in the SAM file for Fly (or, depending on how you do the renaming, RNEXT is now chr22
, but it doesn't really matter). The problem is that the RNEXT value is no longer a valid chromosome name (i.e. it's not listed in the SAM header) so when you convert the file to BAM it gets replaced with a *
and the above message is displayed.
Note that conversion from SAM to BAM is handled by the Rhtslib code so the resulting BAM should be considered valid. It's just that a few records lost their mate.
2 ways to avoid this:
Yes, you're exactly right. The RNEXT fields of chimeric pairs (is that the right word?) get changed in the sam->bam conversion in a way that makes the sam flags not make sense, and in a way that the mate can now have the wrong information. I do see that sam_parse1 error - it was foolish of me to go forwards ignoring that error, but I've grown a bit numb to the fussiness of some of the command-line tools (especially picard/GATK): they often don't seem to have any downstream consequences.
It turns out if I add isProperPair=TRUE
to the param I can avoid the crash.
flagsGOOD <- scanBamFlag(isPaired=TRUE, isProperPair=TRUE, isUnmappedQuery=FALSE, hasUnmappedMate=FALSE)
paramGOOD <- ScanBamParam(flag=flagsGOOD, what=what)
res1good <- scanBam(BamFile(path), param=paramGOOD) # works (albeit result is empty with this bam)
res2good <- scanBam(BamFile(path, asMates=TRUE), param=paramGOOD) ## works
flagsBAD <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, hasUnmappedMate=FALSE)
paramBAD <- ScanBamParam(flag=flagsBAD, what=what)
res1bad <- scanBam(BamFile(path), param=paramBAD) # works
res2bad <- scanBam(BamFile(path, asMates=TRUE), param=paramBAD) ## crashes
and for this purpose I didn't want to read in the read pairs that were not properly paired, so I'm good to go. Interesting that it totally crashed R, though! So unusual to see that.
Thought I'd got a solution but I'm not out of the woods yet. Even though adding isProperPair=TRUE
worked on problemPair.bam and Nnk_chip_rep3.ATfilt.combined.dm6.chr4.bam, it did not work with my full bam file containing reads on all chrs, or on another bam file where I'd used command-line samtools to select properly paired reads. I must have an additional problem in the full file.
Right, chimeric pairs.
I've never really trusted isProperPair
. It's not even a well defined concept. There is no clear definition of it in the Samtools specs so it seems that it's completely up to the aligners how they want to handle it.
Another solution would be that you remove the chimeric reads. You already have your own sed
-based script to rename chromosomes. Maybe you could add a preliminary step where you filter out reads where RNAME and RNEXT don't match using grep
or something like that. Most aligners should set RNEXT to =
when the 2 mates are on the same chromosome so you could filter just based on that (i.e. no need to actually compare RNAME and RNEXT). Sounds pretty straightforward.
Also it sounds like Rsamtools::filterBam
should be the tool to do this in Bioconductor but I'm not familiar with it so I don't know how easy or hard it would be.
I've now used command-line samtools to filter my bam files to retain only proper pairs, and to get rid of supplementary/secondary alignments. Looks like after that filtering, I can read in the full-genome bam file without R exploding. Now testing it on the rest of my files.
It sounds like I could mimic that behavior withRsamtools::filterBam
but the command-line filtering seems most straightforward for now.
Sounds good Janet. Note that this works because in your files the records where RNEXT is set to *
(after split, renaming, and sam-to-bam conversion) also happen to be "improperly paired" and/or part of a secondary alignment. Not sure that would be true in general. Anyway, if your code is for a one-time use and not intended to be deployed as part of a production pipeline, no need to worry. Otherwise, you might want to also consider filtering out based on what's in RNEXT. This would completely eliminate the risk of seeing this Rsamtools' bug crash your pipeline ;-)
Right, it's not a production pipeline. It did work on the rest of my files, so I'm set.
I think though, that requiring 'proper pairing' and no secondary mappings should already ensure that RNEXT is always the same as RNAME? That's my understanding of proper pairing, but I'm not sure all mappers follow that.
Thanks again for your help,
Janet
Exactly. The SAM specs don't say what a "proper pairing" is so you can only assume. There is no guarantee and mappers could set the "proper pairing" flag based on whatever criteria they choose.
Is this closed now since the https://github.com/Bioconductor/Rsamtools/issues/16 related issue was taken care of?
I'm closing this now.
@jayoung: Feel free to reopen if you're still seeing this problem after upgrading to Rsamtools 2.2.2 (release) / 2.3.4 (devel).
hi there,
I am getting a core dump trying to use readGAlignmentPairs to read in a bam file (actually a set of bam files). I can use readGAlignments on the same bam file without any trouble.
I get a similar issue whether I'm running R on a node of our linux cluster or on my mac. On the linux cluster it seems to happen every time I try to read the file, and on the mac it's a bit more hit and miss - sometimes it works fine, but repeating the same command a few seconds later causes R to crash. Weird.
Some code is below. I think I'm using the latest release versions of everything. I'm happy to upload a bam file somewhere if that's useful.
thanks!
Janet
on linux:
and on the mac
here's the crazy long R-crash message my mac shows (it's a lot more verbose than the linux crash was....)