MelbourneGenomics / cpipe

The open source version of the Melbourne Genomics Health Alliance Exome Sequencing Pipeline
Other
33 stars 13 forks source link

gap annotator script #139

Closed Ox5f3759df closed 8 years ago

Ox5f3759df commented 8 years ago

Forgive the lengthiness. I'm new.

I’ve been working with the gap annotator to include it into cpipe and have added it to the pipeline on meerkat: /misc/vcgs/exome/dev. I have noticed some interesting behaviour with the output.

The command that cpipe used to generate the results was: python /misc/vcgs/exome/dev/tools/gap_annotator/1.0/gap_annotator.py --min_coverage_ok 15 --coverage qc/NA12878_CARDIACM_MUTATED.merge.dedup.realign.recal.CARDIOM.bed.cov.gz --db /misc/vcgs/exome/prod/tools/annovar/humandb/hg19_refGene.txt --beds ../design/CARDIOM.bed >> qc/NA12878_CARDIACM_MUTATED.merge.dedup.realign.recal.CARDIOM.bed.cov.gap_annotator.csv

The files: gap-annotator-analysis.zip

If you load the above 3 into IGV you will be able to see some of the interesting output that I will describe below.

(min coverage used 15) Example 1. Chromosome Start End chr10 88441499 88441508

If you look at: gap-annotator-analysis/snapshots/example-1-0-start-chr10-88441499-88441508.png

You can see that at that base the total count is 16 which is above the min coverage of 15.

The next snapshots for this example show that the end was below the min coverage. However the next base was also below the min coverage and should have been included. gap-annotator-analysissnapshots/example-1-1-end-chr10-88441499-88441508.png gap-annotator-analysis/snapshots/example-1-2-end-chr10-88441499-88441508.png

Example 2. Chromosome Start End chr1 156096673 156096705

In this case you can see from: gap-annotator-analysis/snapshots/example-2-0-start-chr1-156096673-156096705.png That the start 156096673 has a total read count of: 17 and the next snapshot: gap-annotator-analysis/snapshots/example-2-1-start-chr1-156096673-156096705 shows that the min coverage of 15 starts at the next index.

The end snapshot: gap-annotator-analysis/snapshots/example-2-2-end-chr1-156096673-156096705 Shows that while the end index for the gap annotator output is 156096705 which indeed is below the min coverage, the next index 156096706 is also below the min coverage. It also is captured within the CARDIOM.bed file which the annotator used.

There are many examples of the above and can be seen by viewing the above 3 files in IGV. It seems to me like there is an issue here as it's not capturing the correct index ranges.

supernifty commented 8 years ago

Thanks for the in-depth analysis!

To retain compatibility with the previous gap annotation tool, the co-ordinates in the CSV file are inclusive. i.e. If the CSV says 1,2 then the equivalent BED would be 1,3, so to convert to a bed file, you need to add 1 to the end co-ordinate. Does this explain the difference?

Potentially we could move to BED compatible co-ordinates but need to confirm with existing users that this won't cause issues.

I hope this explains the unexpected behaviour - let me know if not.

ssadedin commented 8 years ago

That might explain the issues with the end of the ranges. However we also have some issues with the start of the range as well - here's an example of output from the gap annotator - note the range selected:

image

Here we asked it for all regions below 15x. However here's the BAM file in IGV:

image

So at the start position of that range (156096673) the gap annotator says it is below 15x but IGV says we have 17 reads.

supernifty commented 8 years ago

I believe your issue with the start of the ranges is because BED format uses zero based co-ordinates whereas IGV uses 1-based co-ordinates. So 156,096,673 from the generated CSV corresponds to 156,096,674 in IGV. I'm basing this on https://www.broadinstitute.org/igv/BED. Let me know if you don't think this is correct.

ssadedin commented 8 years ago

Thanks Peter, we think that makes sense - sorry for the confusion!