38 / d4-format

The D4 Quantitative Data Format
MIT License
150 stars 20 forks source link

Bug in view? different values when including the first position in a chromosome #59

Closed mrvollger closed 1 year ago

mrvollger commented 1 year ago

Hello,

I am running into a weird issue where if I include the "0" base of a chromosome in my view command I get a different result than if I start with the 1 position. E.g.

$ d4tools view fdr.peaks.d4 chr1:0-100000   
chr1    0       100000  0       0       0       0

$  d4tools view fdr.peaks.d4 chr1:1-100000 | head
chr1    1       60258   0       0       0       0
chr1    60258   60310   0       0       1       0
chr1    60310   60427   0       0       0       1
chr1    60427   60481   0       0       1       0
chr1    60481   60637   0       0       0       1
chr1    60637   60657   0       0       1       0
chr1    60657   60810   0       0       0       1
chr1    60810   60869   0       0       1       0
chr1    60869   61019   0       0       0       1
chr1    61019   61025   0       0       1       0

I thought this might be because 0 is invalid with the region format, but I also get zeros when I don't select any region at all:

$ d4tools view fdr.peaks.d4 | head -n 2 
chr1    0       248956422       0       0       0       0
chr2    0       242193529       0       0       0       0

I have my file uploaded here in case that is helpful: https://eichlerlab.gs.washington.edu/help/mvollger/tracks/fiberseq/trackHub_GM12878_2022-08-18/fdr.peaks.d4 And this is my version of d4tools

d4tools --version
D4 Utilities Program 0.3.8(D4 library version: 0.3.7)
Usage: d4tools <subcommand> <args>
Possible subcommands are:
        create          Create a new D4 depth profile
        framedump       Dump The container data
        index           Index related operations
        ls-track        List all available tracks in the D4 file
        merge           Merge existing D4 file as a multi-track D4 file
        plot            Plot the specified region
        show            Print the underlying depth profile
        stat            Run statistics on the given file
        view            Same as show

Type 'd4tools <subcommand> --help' to learn more about each subcommands.

Thanks in advance, Mitchell

arq5x commented 1 year ago

Thanks Mitchell, we are looking into this.

38 commented 1 year ago

Hi @mrvollger, thanks for reporting this bug. Please don't mind my delayed reply for some personal reasons.

I confirmed this is a d4 bug and should be fixed. After investigating the bug, I realized that the output

chr1    0       100000  0       0       0       0

is expected, while

chr1    1       60258   0       0       0       0
chr1    60258   60310   0       0       1       0
chr1    60310   60427   0       0       0       1
chr1    60427   60481   0       0       1       0
chr1    60481   60637   0       0       0       1
chr1    60637   60657   0       0       1       0
chr1    60657   60810   0       0       0       1
chr1    60810   60869   0       0       1       0
chr1    60869   61019   0       0       0       1
chr1    61019   61025   0       0       1       0

is wrong.

The d4 file you provided actually have all none-zero value in chr11.

This bug actually caused by how d4 uses the frame index - the binary search inside the index doesn't actually check if the item is on the same chromosome and this causes values defined for chr11 mistakenly returned for the query of chr1.

And I have pushed the fix

$ target/debug/d4tools view /tmp/repo.d4 chr1:0-100000
chr1    0       100000  0
$ target/debug/d4tools view /tmp/repo.d4 chr1:1-100000
chr1    1       100000  0

Please let me know if you have any further question.

Thanks! Hao

mrvollger commented 1 year ago

@38 thanks so much! I appreciate the fix :)

arq5x commented 1 year ago

Excellent, thanks @38. Might be time for a new formal release!