Closed pamelarussell closed 2 years ago
Seems related to #12. Does it evaluate successfully if reading from a d4 made with -S
?
Thanks @snystrom , unfortunately rerunning with a file generated with -S
did not solve the issue.
Hi @pamelarussell,
Thanks for reporting the issue. After digging into the issue, I found this is due to the way how the sparse D4 file is handled. Instead of the normal per-base mode, for the sparse D4 file, all the stats are handled per-interval.
However, previously we don't take the value that is not defined by the secondary table into account. I've added the code that handles those value that is defined in the primary table. It seems the issue has been fixed after I added the change, please let me know if this is the case on your side.
Cheers, Hao
Seems related to #12. Does it evaluate successfully if reading from a d4 made with
-S
?
This should be related to enabling -S
The recent change to d4tools makes the tool smarter to detect the optimal parameters, so it will automatically enable -S
. But in this case, -S
option actually makes the stat task executed in a per-interval mode. And there is a bug the per-interval mode code is dropping all the values that is not defined in the secondary table.
Hi @38, I tried building the version from yesterday's commit, but d4tools create
ran for an hour before I finally killed it. I'm not sure what is wrong; I'm trying to build from a pretty small bedgraph file. I tried both with and without the -S
option. The latest stable version from conda works fine. Here are the bedgraph file and genome file. bedgraph.zip
Additionally, could you please clarify whether you believe the issue is with initial creation of the D4 file or later reading of the file? I ask because I am potentially using different versions of the tools for both. My failing test is referring to a crate we've built on top of the latest tagged version here (0.3.6
), and that is where we are seeing that the histogram is missing 0
s.
The latest stable version from conda works fine.
How long the stable version from conda takes for this input? I think this is related to the fact that in your bedgraph file, there are a huge 0 valued intervals. We should encode those values with interval mode - but unfortunately currently this is not the case. So I think for your input, both stable version and conda version should be slow. (Also make sure you are testing against release build)
Additionally, could you please clarify whether you believe the issue is with initial creation of the D4 file or later reading of the file?
For the original issue, I believe this is not related to create - you can use d4tools view
to verify that. The problem is the aggregation method doesn't handles the default value.
I have another fix for the long creation time issue committed to the branch, please let me know if this works on your side.
Thanks!
The stable version from conda took a couple of minutes (which did seem unexpectedly long), while the latest from GitHub ran for over an hour before I killed it.
If the original issue is related to reading/processing the file rather than writing, then I could use a bit of help with how to test your latest fix. I am not a Rust user and am building our crate from a Cargo.toml
(here) that @sstadick built which references d4
version 0.3.6
on crates.io. How can I instead build our project from your sources rather than the package repository?
Hi @pamelarussell! You should be able to specify a commit in cargo as follows:
[dependencies]
d4 = { git = "https://github.com/38/d4-format", rev="<commit-hash>" }
extendr-api = { git = "https://github.com/extendr/extendr", branch="master" }
ordered-float = "2.10.0"
serde = { version = "1.0.136", features = ["derive"] }
Cargo will look for the d4
crate in the d4-format
workspace by default and use the commit specified in rev
. You can also use d4 = { git = "https://github.com/38/d4-format", branch="master" }
to just stay on the latest from @38
Thanks all. The latest fix seems to have worked and I am getting correct histograms now! 👍
Hi,
I am seeing some unexpected behavior in calls to
d4::task::Histogram::with_bin_range()
. Specifically, I am invoking the method on a region that includes some0
s and some positive values, but the returned histogram does not include the0
s. The other value counts are correct.Here is our line of code where we are invoking this.
Here is a test where we try to use this code to compute the median of a histogram whose underlying values are
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 3 3
.The test fails because the median is returning
2
, which is the median of the non-zero values in the region, instead of0
. I have verified via debugging that this can be traced to the histogram at the above line of code not counting the0
s.Here is the D4 file we are using for this test. example2.d4.zip
Thanks in advance for any help you can suggest!