Open quantumdot opened 7 years ago
array()
passes additional kwargs to local_coverage
(docs), which allows you to choose one of three methods for bigWig via the method
argument. These defer to bx-python and UCSC's bigWigSummary, so those packages are what determine how zeros and missing data are handled.
Thanks for your response, however after reviewing the metaseq code more thoroughly it appears that you are assigning NaN values to zero within the BigWigAdapter.summarize()
method, specifically line 171 for method get_as_array
and line 193 for method summarize
in filetype_adapters.py
. This makes it impossible to ignore NaN values when computing an average profile over many intervals.
The real issue is that I am trying to reconcile/reproduce an average profile plot originally generated with the Cistrome tool siteprobw
(a port of the sitepro
script originally released in the CEAS package which was modified to accept BigWig files). This script also uses the bx-python BigWigFile.summarize()
, however the plots appear quite different. Please see the following images for examples. Top: profiles produced by Cistrome siteproBW
. Bottom: Profiles produced using metaseq
. Red profiles correspond in both images, while black and green profiles correspond in their respective plots.
The BigWig file in question was generated from competitive hybridization microarray data, and so contains many intervals with missing data. The data contains both positive and negative real numbers. If needed, I can provide data to reproduce the problem.
You're right, thanks for pointing this out. As you can see it's been a while since I've revisited the code. I'd encourage you to check out deepTools which has more developers working on it that can invest time in it and has quite robust bigwig handling.
But regardless, this should be fixed. The goal is to have NaNs reported correctly and not set to zero, right? My guess is that I did that to make downstream processing easier, and worked fine for e.g. chipseq signal but it's a terrible way of handling it for intended-to-be-sparse bigwigs. If you could send me the example bigwigs (dalerr@niddk.nih.gov) along with the siteprobw parameters you used, I'll fix it and add tests.
Thanks for looking at this. Yes, I have used deepTools, and it is a great package. I completely understand about limited time for projects!
I actually like the method that bx-python uses to report the data which gives access to sum, sum squares, min, max, and valid count, but it is also great to have access to the raw data matrix as well. I sent an example bigwig to the email address you provided. It is approximately 17MB, which should go through, but please let me know if it does not.
@quantumdot can you give the issue-30
branch a test? Tests on travis-ci are currently failing, but for reasons related to the test environment -- local tests on this branch run fine.
@daler I gave the issue-30
branch a try, and seems to work nicely now when using method summarize
, but it looks like the issue is still present when using method get_as_array
, which I am tracing back to line 220
in filetype_adapters.py
. Should be pretty simple to extend your fix to this code as well. Below is sample output when using method=summarize
Can you give the master branch a try? If this works I'll release a new version on PyPI and bioconda.
@daler Are you sure about the branch master
? I don't see the changes incorporated in that branch.....
Whoops, wrote that before clicking the PR merge button...can you try now?
How is missing signal data in bigwig file treated? Are these intervals treated as zeros or ignored? Is there an argument for how this is treated in the
array()
method?