genome / bam-readcount

Count bases in BAM/CRAM files
MIT License
305 stars 95 forks source link

Non -ACGTN counts #36

Open iranmdl opened 7 years ago

iranmdl commented 7 years ago

Hi there, I'm trying to extract relevant coverage information using bam-readcount tool. And I have realised that sometimes there are this kind of rows:

1 267356 A 1 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 -ATA:1:27.00:0.00:0.00:0:1:0.22:0.03:0.00:0:0.00:101.00:0.11

What does it mean? In 267356 position I have one read with TA insertion? In this case, why is not simply the A counter increased by 1 and that's all? I hope the question is more or less clear. Thank you in advance,

ernfrid commented 7 years ago

This means that at position 267356 you have one read with a deletion of ATA (hence the minus sign prefix). Deletions are reported (only) at the location of the first deleted base.

Insertions are a little more complicated.

  1. Insertions are reported at the base before the inserted sequence and reported like +AT for an insertion of an AT after the reported location.
  2. By default, insertions do not count towards the overall depth at that location. Reads containing an insertion are reported for both the base existing at the reported location AND the values reported with the insertion allele.
  3. If you run bam-readcount in "insertion-centric" mode then reads supporting the insertion are not counted in the values for the base at the reference position and ONLY counted for the values associated with the insertion.

Hope that makes sense.

sheenamt commented 7 years ago

So, if i was looking for a single base pair insertion at a site (say chr17:41209079), would I look for the insertion at chr17:41209079 or chr17:41209078? If my 'region' file had only chr17:41209079 listed, would I be able to recover the insertion or would I need to adjust my 'region' file to include 17:41209078?

iranmdl commented 7 years ago

Thank you for the clarification @ernfrid . So, if I have understood correctly, in position 267356 in the reference there is an A. And there is one read mapping at that specific position saying that the A has been deleted? Is this correct?

ernfrid commented 7 years ago

@sheenams - I realize I'm being a bit pedantic, but the devil is in the details here. The answer would depend on what you mean when you say a

single base pair insertion at a site (say chr17:41209079)

Insertions are events that happen between coordinates and you need to choose a convention to report them relative to reference sequence coordinates.

I'll try an example below (let's continue to pretend it's on chr17):

      1 2 3   4 5 6 
Ref   A C G - T G A
      | | |   | | |
Var   A C G A T G A

Here there is an insertion of an A in the variant sequence between bases 3 and 4 of the reference sequence (1-based coordinates). To return counts from bam-readcount for this variant you would need to include the following region in your region file: chr17 3 3 or specify chr17:3-3 on the command line. Larger regions encompassing that base should also return the insertion.

ernfrid commented 7 years ago

@iranmdl - You understand correctly.

iranmdl commented 6 years ago

Hello again @ernfrid , I have another question. Does bam-readcount output "complex" variants? Example:

chr pos ref alt
10  94005   GC  AA

Should i look the position 94005 and 94006, for each change, or can I have in the position 94005 both changes included? Hope it is clear... Or here we have an even more complicated case, SNP+SNP+Deletion. Let's say that there is a true variation... how would bam-readcount show this?

1 236275238 AACATTGAAAA GT

Thank you again, :)

ernfrid commented 6 years ago

bam-readcount is pretty simplistic in its operation. It's only aware of the read alignments and has no concept of linkage between positions other than gaps in the alignments.

This means that you'll have to look at both positions for your first example. For the second example, it would depend on how the alignments were reported in the BAM file.

Ideally, there would be a haplotype-aware counter that would realign the reads to candidate haplotypes and report counts. I've wondered if this could be done by leveraging vg, but I've not investigated.