HuiyangYu / PanDepth

An ultra-fast and efficient genomic tool for coverage calculation
https://github.com/HuiyangYu/PanDepth
MIT License
128 stars 4 forks source link

Support for bed tags, bedgraph output and coverage thresholds #9

Open jamigo opened 1 month ago

jamigo commented 1 month ago

First of all, let me say that I was shocked to see how fast this tool works. Being able to obtain all RefSeq's gene coverage information from a 30X human WGS sample in less that 1 minute was really impressive.

As many other bioinformaticians, we are currently using mosdepth to obtain global and/or genic coverage from our samples, so being able to replicate its output as much as possible would be really helpful. Let me ask for 3 (I'm guessing minor) modifications that they would be really really useful for all mosdepth users out there.

BED TAGS

PanDepth is able to use a bed file to generate coverage statistics for each region. I was surprised that you create a new column RegionID based on the first 3 bed file columns (chr, start, end). In case the bed file contained a 4th column with tags, wouldn't it be possible to keep that tag in the output? Is it a problem that these IDs you work with must be unique? Maybe outputting another column with the tag next to RegionID, or appending the tag to the RegionID content?

Bed tags example:

chr13   32316460        32316527        BRCA2;NM_000059.4;2     67      +
chr13   32316460        32316527        BRCA2;NM_001406719.1;2  67      +
chr13   32316460        32316527        BRCA2;NM_001406720.1;2  67      +
chr13   32316460        32316527        BRCA2;NM_001406721.1;2  67      +
chr13   32319076        32319325        BRCA2;NM_000059.4;3     249     +
chr13   32319076        32319325        BRCA2;NM_001406719.1;3  249     +
chr13   32319076        32319325        BRCA2;NM_001406720.1;3  249     +
chr13   32319076        32319325        BRCA2;NM_001406721.1;3  249     +
chr13   32325075        32325184        BRCA2;NM_000059.4;4     109     +
chr13   32325075        32325184        BRCA2;NM_001406719.1;4  109     +
chr13   32325075        32325184        BRCA2;NM_001406720.1;4  109     +
chr13   32325075        32325184        BRCA2;NM_001406721.1;4  109     +

See how useful it would be to be able to parse PanDepth output straight away having the gene, the transcript, or even the exon number in the same file, instead of having to merge both files and parse them.

May it be possible to implement a configurable option, defaulted to 0, to choose which input column the user would like to see in the output?

BEDGRAPH OUTPUT

In the same way I was surprised to see how fast PanDepth is, I don't understand why the "output all the site depth" option for each single genome position. I understand that this may be useful for some users, but writing this information in this chr,pos,depth format takes ages compared with the rest of PanDepth capabilities. Would it be possible to group these coverages in to a chr,start,end,depth bedgraph format using some light transformation like @nathanhaigh's awk solution? This would even help to save I/O time.

May it be possible to implement a configurable option to choose from all positions or bedgraph coverage?

THRESHOLDS

This may not be as simple as the other 2 options mentioned, but I hope it isn't that complicated. PanDepth's bed.stat output include 2 columns, CoveredSite and Coverage(%) that, although it's not mentioned in the documentation, I'm guessing that they refer to >1X positions. I would suggest to allow entering different thresholds and outputting them in corresponding columns.

So, instead of having the current output #Chr Start End RegionID Length CoveredSite TotalDepth Coverage(%) MeanDepth I would suggest to indicate somehow the threshold in the coverage column, e.g. #Chr Start End RegionID Length TotalDepth MeanDepth CoveredSite_1X Coverage(%)_1X CoveredSite_10X Coverage(%)_10X CoveredSite_20X Coverage(%)_20X CoveredSite_30X Coverage(%)_30X or, as mosdepth does, to generate 2 different files, one for CoveredSites #Chr Start End RegionID Length TotalDepth MeanDepth CoveredSite_1X CoveredSite_10X CoveredSite_20X CoveredSite_30X and one for Coverage(%) #Chr Start End RegionID Length TotalDepth MeanDepth Coverage(%)_1X Coverage(%)_10X Coverage(%)_20X Coverage(%)_30X

May it be possible to implement a configurable option, defaulted to 1, to allow entering different thresholds, such as "1,10,20,30", for instance?

Congratulations for such an impressive work, and thank you very much for it.

HuiyangYu commented 1 month ago

Thank you very much for your attention to PanDepth. Your suggestions are both valuable and effective. I am currently updating to a version that supports multi-threaded computation and no longer requires the index file. All of your recommendations will be incorporated into the latest version. Thank you once again for your input.

jamigo commented 1 month ago

Fantastic news! That sounds ever better and faster.

I was mainly focusing on mosdepth comparison, but I have other ideas in mind, such as adding some padding to the bedfile (i.e. adding for instance 10bp to the bedfile regions instead of having to create a temporal bedfile if that 10bp padding is desired), generating bigwig output along with bedgraph, generating the chr.stat.gz output always even if a bed or gene file is used (forced or optional), allowing selection of other GTF features instead of exon and CDS, ...

Thank you very much anyway for paying attention to the users' feedback.