Paired-end not correctly identified

Caninus23 commented 7 years ago

Hi,

I've just started to use copywritR and I encountered a really wierd note:

INFO [2016-10-25 11:42:00] Paired-end sequencing for sample NG-9739_1_lib127215_4623.sorted: TRUE INFO [2016-10-25 11:42:00] Paired-end sequencing for sample NG-9739_3_lib127217_4623.sorted: FALSE INFO [2016-10-25 11:42:00] Paired-end sequencing for sample NG-9739_4_lib127218_4623.sorted: TRUE

So... it says my second sample is not paired end, but of course it is. I've taken a look at the flags for all three samples, and actually have no idea why it identifies the second sample as not-paired. Here are the flags of the second sample:

44474716 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 44474128 + 0 mapped (100.00% : N/A) 42102178 + 0 paired in sequencing 21051089 + 0 read1 21051089 + 0 read2 41254896 + 0 properly paired (97.99% : N/A) 42101634 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Maybe you have an idea why it misidentifies? (I think for the actual run this should not be a problem, or am I wrong?)

thomasKuilman commented 7 years ago

Hi,

That is weird indeed. The function that checks whether a bam-file is paired-end or not is below. It takes the first read and based on the FLAG determines the type of sequencing. However, I presume you used samtools flagstat, which also uses FLAG info to retrieve its stats. Could you try to run the NumberPairedEndReads function on your single bam-file?

library("GenomicAlignments")
NumberPairedEndReads <- function(sample.paths) {
    bam <- open(BamFile(sample.paths, yieldSize = 1))
    close(bam)
    what <- c("flag")
    param <- ScanBamParam(what = what)
    bam <- readGAlignments(bam, param = param)
    intToBits(mcols(bam)$flag)[1] == 01
}
NumberPairedEndReads("/PATH/TO/YOUR/BAM")

Caninus23 commented 7 years ago

I used flagstat, yes. So.. I used NumberPairedEndReads for sample 1 and 2 and get the following:

GAlignments object with 1 alignment and 1 metadata column: seqnames strand cigar qwidth start end width

[1] 1 + 126M 126 12814 12939 126 njunc | flag | [1] 0 | 163 --- seqinfo: 25 sequences from an unspecified genome [1] TRUE GAlignments object with 1 alignment and 1 metadata column: seqnames strand cigar qwidth start end width [1] 1 + 95M1I29M1S 126 10015 10138 124 njunc | flag | [1] 0 | 0 --- seqinfo: 25 sequences from an unspecified genome [1] FALSE So.. the function really sees a 0 there, but why? If this is due to a malformed bam flagstat would this this also, or?

thomasKuilman commented 7 years ago

Can you manually check the FLAG from your second bam-file using

$ samtools view your.bam | less

? Is the FLAG indeed 0? In that case, something would go wrong with GenomicAlignments. Alternatively, samtools flagstat does not use the FLAG field for deriving its stats (which would be highly confusing). If you could share a minimal bam with let's say 2 reads that you can share and that recapitulates the issue, I could try and see where the problem lies. You can find my (Thomas Kuilman) email address at the end of the code tab.

BTW, I did not respond to your initial question, which is whether misclassification as single-end sequencing sample would affect its analysis. Whether a sample is paired-end determines how the high-quality reads are being filtered; for paired-end data there is an additional requirement to be in 'proper pairs' (FLAG 2). I expect this to have a negligible impact on the outcome.

Caninus23 commented 7 years ago

Hmm... if done what you asked and get for all three samples all kinds of flags in the first few reads, and in all cases also zeros. In the second sample the first read is zero, in the first sample, it is the third and fourth. I have no idea why this is the case.. maybe the tool with which I generated the bam files has some errors (but then it is strange that flagstat gets it right). I will send you the first few reads of the first and second sample.

thomasKuilman commented 7 years ago

Hi,

I received a small sample of your bams, and it is indeed the bam-file itself that seems to be malformed. That is, even though there is a mate for every read, the flag 1 is not set. It seems that this is the case for pairs where one read is mapped, and the other one not. I would suggest to use Picard FixMateInformation (samtools has a function fixmate too, but that I have never used). Please let me know whether that fixes the issue.

Caninus23 commented 7 years ago

Hi back,

I've used Picard (and found some bad Cigars) but the flag field does not change and I assume this function only fills in missing information but does not correct wrong fields (this is what fixmate seems to do). So.. no chance in correcting this with FixMateInformation. Besides what you mentioned above, is this information important later on in your tool? Or could I just remove the first wrong pair and all is well (or hardcode a yes into your tool)? Either way, this is no error of your package (instead of the sometimes stupid CLC Workbench) so I think this can be closed here.

Many thanks!

thomasKuilman commented 7 years ago

No, the info is only used for filtering high-quality reads, which (as I said) includes filtering for the flag 'proper pair'. Since it seems (but you could check that) that the reads with flag 0 have an unmapped mate, these would anyway be excluded for further analysis. In that case you can safely filter reads with flag 0 (and their mates).

PeeperLab / CopywriteR

Paired-end not correctly identified #10