cbuhay / ExCID

The Exome Coverage and Identification Report displays the coverage of every target region in your capture design. It also displays regions below an adjustable coverage threshold.
14 stars 5 forks source link

Coverages calcuated by WGS_Stats_v1.java are all off-by-one position #3

Open bpow opened 9 years ago

bpow commented 9 years ago

Using NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.bam (ftp://ftp-trace.ncbi.nih.gov/giab/ftp/technical/NA12878_data_other_projects/alignment/NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.bam) as an example, WGS_Stats_v1.java produces NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.wholeGenomeCov.fasta with the coverages all displayed in the wrong position.

The first clue is the FASTA sequence lines:

>1 1 249250622
>2 1 243199374
>3 1 198022431
>4 1 191154277
>5 1 180915261

(etc...)

These imply one-based coordinates (since the first coordinate is 1, the convention would usually be to have closed intervals on both sides), but the end coordinates are all one more then the number of bases in the reference genome.

The first non-zero coverage site in this file should be 10245 (I verified this by looking in IGV). but the output of WGS_Stats_v1.java has 102 rows of zeros, followed by 45 zeros, then a 1, so that would place the first nonzero coverage at 10246-- one too high.

So, it seems to me that an extra 0 is added at the beginning of every genomic coverage file.

Rashesh7 commented 9 years ago

Hi,

The Size of each Chromosome reported in the Header is off by 1 and can certainly should be changed. Also you are correct about the extra 0 in the beginning of Genomic coverage of every Chromosome, the ExCID code that parses the Whole Genome Coverage File is aware of it and Always skips the first 0. The Outputs will report 10245 to have 1X coverage. I agree that the File format should be changed, but will not affect the final output.

Your feedback is much appreciated.