genome / bam-readcount

Count bases in BAM/CRAM files
MIT License
298 stars 95 forks source link

Reason for low coverage output for deepseq data #85

Open YashaNazirB opened 2 years ago

YashaNazirB commented 2 years ago

Hi

I am running the following code for my deepseq data analysis to extract BAM readcounts. Although my expected coverage is somewhat between 100,000 to 350,000 however, after I run the following code, my output shows very low coverage for deepseq genomic positions which is around 9000 max. Why am I getting this low coverage? I even defined the -d to be 1000000000.

bam-readcount -w0 -f mm9.fa -d1000000000 aligned_sorted_UNGKO_stim.bam |awk -F ":|\t|=" 'BEGIN {OFS = "\t"}; {print $1, $2, $3 , $4, $21 , $35, $49 , $63}' > BRC_UNGKO_stim.txt

chrisamiller commented 2 years ago

According to this issue, specifying a region (and reference) is a workaround for this problem: https://github.com/genome/bam-readcount/issues/3 (Counting every base in the entire reference is going to be somewhat slow anyway, so splitting may sometimes be a good idea)

What happens if you run bam-readcount -w0 -f mm9.fa -d1000000000 aligned_sorted_UNGKO_stim.bam chr1:1-999999999 (or actually substitute the length of the chr)

YashaNazirB commented 2 years ago

Hey Chris,

I tried selecting a defined region but it did not generate any output. Output was 0 bytes.

How should I proceed?

-Yasha


From: Chris Miller @.> Sent: Saturday, September 25, 2021 12:44 AM To: genome/bam-readcount @.> Cc: Yasha Nazir Butt @.>; Author @.> Subject: Re: [genome/bam-readcount] Reason for low coverage output for deepseq data (#85)

[EXTERNAL]

According to this issue, specifying a region (and reference) is a workaround for this problem: #3https://github.com/genome/bam-readcount/issues/3 (Counting every base in the entire reference is going to be somewhat slow anyway, so splitting may sometimes be a good idea)

What happens if you run bam-readcount -w0 -f mm9.fa -d1000000000 aligned_sorted_UNGKO_stim.bam chr1:1-999999999 (or actually substitute the length of the chr)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/genome/bam-readcount/issues/85#issuecomment-927009919, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM3AU2XQP2OXQZ5WH5KROI3UDVHUPANCNFSM5EW7UH7A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

chrisamiller commented 2 years ago

Do you have a small example bam file that you can post, along with a command that reproduces the problem?

YashaNazirB commented 2 years ago

Commands that I tried are as follows:

bam-readcount -w0 -f mm9.fa -d1000000000 aligned_sorted_UNGKO_stim.bam chr12:114663740-114595637 |awk -F ":|\t|=" 'BEGIN {OFS = "\t"}; {print $1, $2, $3 , $4, $21 , $35, $49 , $63}' > BRC_UNGKO_stimtest.txt

bam-readcount -w0 -f mm9.fa -d1000000000 aligned_sorted_UNGKO_stim.bam chr12:114663740-114595637 > BRC_UNGKO_stimtest1.txt

Following is the link to the bam file: https://drive.google.com/file/d/1X4qbqChdAIXDXtcKnbUOHJ44sGmBdePP/view?usp=sharing

I am using mouse mm9 genome as the reference and the coordinates of interest for deepseq analysis are chr12:114663740-114595637.

Best regards, Yasha

chrisamiller commented 2 years ago

Okay, I can reproduce this error. Thanks for the bug report - we'll look into it.

$ samtools index aligned_sorted_UNGKO_stim.bam

$ docker run -v /tmp:/tmp -v /Users/cmiller/annotations:/annotations --entrypoint /bin/bash -it mgibio/bam-readcount

$ bam-readcount -w 0 -f /annotations/tmp.fa aligned_sorted_UNGKO_stim.bam chr1:3603112-3603112
Minimum mapping quality is set to 0
chr1    3603112 c   10  =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:9:0.00:35.33:0.00:9:0:0.00:0.01:40.67:9:0.99:150.00:0.99  G:1:0.00:32.00:0.00:1:0:0.00:0.02:106.00:1:0.99:150.00:0.99 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

# bam-readcount -w 0 -f /annotations/tmp.fa aligned_sorted_UNGKO_stim.bam chr12:114663740-114595637  >out.rct
Minimum mapping quality is set to 0
[E::sam_itr_next] Null iterator

@apldx - can you take a look and see what's up? FWIW, the tmp.fa in there is just mm9 with "chr" prefixes added. I dropped bam and fasta to use on the local filesystem here for debugging purposes: /storage1/fs1/timley/Active/aml_ppg/analysis/tmp/bam_readcount/issue_85/