HuiyangYu / PanDepth

An ultra-fast and efficient genomic tool for coverage calculation
https://github.com/HuiyangYu/PanDepth
MIT License
133 stars 4 forks source link

Some results have Coverage>100% #6

Closed HongyuanChu closed 5 months ago

HongyuanChu commented 5 months ago

Command line: pandepth -i /data/chuhy/project/Hg/HG002_GRCh38_ONT-UL_GIAB_20200122.phased.bam -o /data/chuhy/project/Hg/pandepth/Hg002 -t 5 20fbc592b4d0c6fbd6410dcf6d60689 Some results have Coverage>100%.

HuiyangYu commented 5 months ago

Which version are you using?

HongyuanChu commented 5 months ago

I am using PanDepth-2.24-Linux-x86_64; If adding a bed file pandepth -i /data/chuhy/project/Hg/HG002_GRCh38_ONT-UL_GIAB_20200122.phased.bam -o /data/chuhy/project/Hg/pandepth/Hg002 -b HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed It would not >100% anymore

HuiyangYu commented 5 months ago

Can you provide a download link for the UL ont of HG002 as well as the reference genome? Also the commands you alignment are needed. I would try to reproduce your results as closely as possible and then look for the exact bug.

HongyuanChu commented 5 months ago

Thank you very much!

bam:

http://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.2.4_2020-01-22/HG002_GRCh38_ONT-UL_GIAB_20200122.phased.bam

bed

https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

HongyuanChu commented 5 months ago

The bam files were standard samples from GIAB. Reads were separately aligned to both hs37d5 and hg38 human genome reference assemblies using Heng Li's minimap2 (https://github.com/lh3/minimap2) with the following options:

-a -z 600,200 -x map-ont

As shown in https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.2.4_2020-01-22/README

HuiyangYu commented 5 months ago

I apologize for the delayed response. We have identified the cause of the bug. In order to conserve memory usage, PanDepth employs a sliding window approach for calculation when detecting the index of bam/cram files. However, this method presents an issue: if the window size is set too small, certain reads may span multiple windows, particularly when a read contains numerous or very long insertions or deletions. As a result, some genomic positions are likely to be redundantly calculated, leading to coverage values exceeding 100%. To comprehensively address this issue and further enhance the multi-threading efficiency of PanDepth, we have decided to perform a complete code refactoring. However, this process may require a substantial amount of time to complete.
As a temporary solution to your issue, you can rename the index file suffix of the bam from 'bai' to another name, such as 'baix', and then recalculate. This will yield the correct results.
It should be noted that when performing calculations without using the index, you can still enable multi-threading. However, setting '-t' to a value exceeding 5 will not further improve computational efficiency.

HuiyangYu commented 5 months ago

This is the result of the calculation without using the index for the ONT BAM file of HG002(HG002_GRCh38_ONT-UL_GIAB_20200122.phased.bam).

image

HongyuanChu commented 5 months ago

Thank you so much for your timely and useful reply!

HuiyangYu commented 5 months ago

This is the result of the coverage calculation using Samtools, which took 160 minutes to complete. (If feasible, we can temporarily keep this issue open so that I can inform you promptly upon completing the code refactoring.) image

hewm2008 commented 5 months ago

We have found a solution to the problem and will release version 2.25 today (or tomorrow ) after the test has no problems

HuiyangYu commented 5 months ago

This is the result of the calculation by version 2.25 using the index for the ONT BAM file.

1716356676831