daler / metaseq

Framework for integrated analysis and plotting of ChIP/RIP/RNA/*-seq data
https://daler.github.io/metaseq
MIT License
87 stars 36 forks source link

How is missing data interpreted? #30

Open quantumdot opened 7 years ago

quantumdot commented 7 years ago

How is missing signal data in bigwig file treated? Are these intervals treated as zeros or ignored? Is there an argument for how this is treated in the array() method?

daler commented 7 years ago

array() passes additional kwargs to local_coverage (docs), which allows you to choose one of three methods for bigWig via the method argument. These defer to bx-python and UCSC's bigWigSummary, so those packages are what determine how zeros and missing data are handled.

quantumdot commented 7 years ago

Thanks for your response, however after reviewing the metaseq code more thoroughly it appears that you are assigning NaN values to zero within the BigWigAdapter.summarize() method, specifically line 171 for method get_as_array and line 193 for method summarize in filetype_adapters.py. This makes it impossible to ignore NaN values when computing an average profile over many intervals.

The real issue is that I am trying to reconcile/reproduce an average profile plot originally generated with the Cistrome tool siteprobw (a port of the sitepro script originally released in the CEAS package which was modified to accept BigWig files). This script also uses the bx-python BigWigFile.summarize(), however the plots appear quite different. Please see the following images for examples. Top: profiles produced by Cistrome siteproBW. Bottom: Profiles produced using metaseq. Red profiles correspond in both images, while black and green profiles correspond in their respective plots. sitepro_2017 jan 05 16-20-27 anon

untitled-1

The BigWig file in question was generated from competitive hybridization microarray data, and so contains many intervals with missing data. The data contains both positive and negative real numbers. If needed, I can provide data to reproduce the problem.

daler commented 7 years ago

You're right, thanks for pointing this out. As you can see it's been a while since I've revisited the code. I'd encourage you to check out deepTools which has more developers working on it that can invest time in it and has quite robust bigwig handling.

But regardless, this should be fixed. The goal is to have NaNs reported correctly and not set to zero, right? My guess is that I did that to make downstream processing easier, and worked fine for e.g. chipseq signal but it's a terrible way of handling it for intended-to-be-sparse bigwigs. If you could send me the example bigwigs (dalerr@niddk.nih.gov) along with the siteprobw parameters you used, I'll fix it and add tests.

quantumdot commented 7 years ago

Thanks for looking at this. Yes, I have used deepTools, and it is a great package. I completely understand about limited time for projects!

I actually like the method that bx-python uses to report the data which gives access to sum, sum squares, min, max, and valid count, but it is also great to have access to the raw data matrix as well. I sent an example bigwig to the email address you provided. It is approximately 17MB, which should go through, but please let me know if it does not.

daler commented 7 years ago

@quantumdot can you give the issue-30 branch a test? Tests on travis-ci are currently failing, but for reasons related to the test environment -- local tests on this branch run fine.

quantumdot commented 7 years ago

@daler I gave the issue-30 branch a try, and seems to work nicely now when using method summarize, but it looks like the issue is still present when using method get_as_array, which I am tracing back to line 220 in filetype_adapters.py. Should be pretty simple to extend your fix to this code as well. Below is sample output when using method=summarize untitled-2

daler commented 7 years ago

Can you give the master branch a try? If this works I'll release a new version on PyPI and bioconda.

quantumdot commented 7 years ago

@daler Are you sure about the branch master? I don't see the changes incorporated in that branch.....

daler commented 7 years ago

Whoops, wrote that before clicking the PR merge button...can you try now?