38 / d4-format

The D4 Quantitative Data Format
MIT License
150 stars 20 forks source link

[enhancement] stat for a region/list of regions #55

Closed darked89 closed 1 year ago

darked89 commented 2 years ago

Hello,

Would it be possible to implement simple median coverage stats for a selected region? While one can get the data using i.e. d4tools view input.bam 22:39349000-39349100 there is "some scripting" step required.

Having such feature would make looping around some putative ChIP-Seq peak much easier.

Best wishes,

Darek Kedra

arq5x commented 2 years ago

Hi Darek,

This is possible with the stat command:

d4tools stat -s median -r your.bed your.d4

If you want just one region, you can use UNIX echo -e to create a BED file "on the fly":

d4tools stat -s median -r <(echo -e "chr22\t39349000\t39349100") your.d4
darked89 commented 2 years ago

Hello,

Sorry for the delay. I have found some non-intuitive output of d4tools and pyd4, see below. In short: how comes that the mean coverage value for 100bp interval is reported as 0 with 39 non-zero values?

Many thanks for your help

Darek Kedra

The snippet of Python code:

chrom_interval = (chrom, bin_start, bin_end)

mean_val_bin = chip_fh.mean([chrom_interval])

print(f"mean_val_bin: {mean_val_bin}")

bin_values = [x[2] for x in pyd4.enumerate_values(chip_fh, chrom, bin_start, bin_end)]
bin_sum = sum(bin_values)

print(bin_values) # (chrom, pos, value) chrom_vals, pos_vals, coverage_vals
print(bin_sum)

Same region from the original BAM:

samtools view ../data_timepoint_2/12_2_C_04873AAD_AACAGGTT-CTTGGTAT_R1_001_all.sorted.rmdup.bam  chr1:762650-762750 

VH00658:3:AAALLTMHV:1:2404:63279:6774   16      chr1    762643  1       50M     *       0       0       GACAGGGGCGACCTCAGTGACGGAACCGGACACAGACGCAGATCTGGCAG      CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU                                                                                                                                    
VH00658:3:AAALLTMHV:1:1611:27282:18531  0       chr1    762698  1       50M     *       0       0       CGACAGGCTTCGGAGCATTTCCGGGCGTCGCGGGACTCCCCGCCGACAGG      CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAS:i:100 XS:i:100        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU
arq5x commented 2 years ago

@38 can you look into why mean above is 0 and not 0.5 and explain the differences in behavior?

darked89 commented 2 years ago

Hello,

In case it is needed I can share the d4 file/minimal example giving that result. Or a mini-bam file.

Hope it may help,

Darek Kedra

38 commented 2 years ago

Hi @darked89 , I believe this is dup to issue #54 and is already fixed in the latest release.

I believe d4tools on bioconda is updated already, please check if the latest upgrade resolves this issue.