GreenleafLab / NucleoATAC

nucleosome calling using ATAC-seq
MIT License
106 stars 31 forks source link

higher resolution nucpos.bed.gz is empty #55

Open igordot opened 7 years ago

igordot commented 7 years ago

I originally posted a comment on a previous issue: https://github.com/GreenleafLab/NucleoATAC/issues/38 , but the original issue is closed, so maybe it wasn't seen. Thus, I am reposting as a new issue since there have been no updates for a few months. Sorry if you just didn't get around to responding.

In the meantime, I ran NucleoATAC on other samples and encounter the same problem every time. I tried both human and much smaller fly genome (so at least coverage should not be a problem).

AliciaSchep commented 7 years ago

Sorry, didn't see your followup on that issue!

Can you run NucleoATAC on the test data in the example folder?

nucleoatac run --bed example.bed --bam example.bam --fasta sacCer3.fa --out test_results

igordot commented 7 years ago

I didn't notice there was an example dataset. I tried it. I got an empty test_results.nucpos.bed.gz. I got 160 lines in test_results.occpeaks.bed and test_results.nucmap_combined.bed.

I didn't see any errors. This was the output:

start run at: 2017-02-13 23:20
---------Step1: Computing Occupancy and Nucleosomal Insert Distribution---------
Making figure
---------Step2: Processing Vplot------------------------------------------------
---------Step3: Obtaining nucleosome signal and calling positions---------------
---------Step4: Making combined nucleosome position map ------------------------
---------Step5: Calling NFR positions-------------------------------------------
end run at: 2017-02-13 23:23
AliciaSchep commented 7 years ago

Thanks for running the test data. Number of lines in test_results.occpeaks.bed is correct, but there should be 130 lines in the test_results.nucpos.bed.gz file, so something is definitely going wrong in execution of code independent of particulars of input data.

A couple more Q's:

NucleoATAC version? Cython version? Python version? Numpy version? Scipy version? Operating System?

Within NucleoATAC directory can you try: nosetests tests/test_var.py (need to have nosetests installed)

igordot commented 7 years ago

NucleoATAC version: 0.3.3 (just reinstalled it to be safe) Cython version: 0.23.4 Python version: 2.7.3 Numpy version: 1.12.0 Scipy version: 0.18.1 Operating System: CentOS 6.8

nosetests result:

/path/NucleoATAC/nucleoatac/NucleosomeCalling.py:86: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(var)
FF
======================================================================
FAIL: Make sure variance calculation is close to what is obtained by simulation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/NucleoATAC/tests/test_var.py", line 33, in test_sd1
    self.assertTrue(abs(sd1-sd2)<0.05*sd1)
AssertionError: False is not true

======================================================================
FAIL: Make sure variance calculation is same as would be obtained through alternate calculation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/NucleoATAC/tests/test_var.py", line 43, in test_sd2
    self.assertTrue(abs(sd1-sd2)<0.001*sd1)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 2 tests in 11.041s

FAILED (failures=2)
AliciaSchep commented 7 years ago

Thanks for the additional info. The versions of those packages or the python version don't seem to be a problem on my system (ubuntu), and I haven't been able to reproduce the failing test or erroneous file output. Could you additionally give the versions of all the other dependencies? And is your version installed in a virtual environment?

igordot commented 7 years ago

I am not using a virtual environment. Is that a problem?

I am not sure sure how to get just the dependencies, but here is my full pip list:

alignlib (0.1)
alignlib-lite (0.2.2)
argcomplete (0.8.0)
argh (0.26.1)
auto-sklearn (0.0.2)
avro (1.8.1)
awscli (1.6.2)
backports-abc (0.5)
backports.shutil-get-terminal-size (1.0.0)
backports.ssl-match-hostname (3.5.0.1)
bayesian-optimization (0.1.0)
bcdoc (0.12.2)
biom-format (1.3.1)
biopython (1.68)
bitstring (3.1.2)
blast (0.3.0)
bokeh (0.11.1)
boto (2.33.0)
botocore (0.73.0)
Bottleneck (1.0.0)
bsExpress (0.4.1b0)
bx-python (0.7.1, /local/apps/python/2.7.3/lib/python2.7/site-packages)
bz2file (0.98)
cairocffi (0.7)
certifi (2015.4.28)
cffi (1.9.1)
CGAT (0.2.0)
click (6.6)
cmd2 (0.6.8)
cmt (1.5.1a0)
cogent (1.5.3)
colorama (0.2.5)
ConfigArgParse (0.11.0)
configparser (3.3.0.post2)
ConfigSpace (0.2.1)
CrossMap (0.2.3)
cryptography (1.7)
csvkit (0.9.0)
cutadapt (1.12)
cvxopt (1.1.8)
cwltool (1.0.20160714182449)
cycler (0.10.0)
Cython (0.23.4)
dbf (0.94.3)
decorator (4.0.9)
deepTools (1.5.11)
distribute (0.7.3)
dnaloop (0.5.16)
docopt (0.6.2)
docutils (0.12)
drmaa (0.7.6)
emperor (0.9.3)
entrypoints (0.2.1)
enum34 (1.1.6)
firebrowse (0.1.8)
Flask (0.11.1)
fluff (0.1.1)
funcsigs (1.0.2)
functools32 (3.2.3.post2)
future (0.15.2)
futures (3.0.5)
gdata (2.0.18)
gdc-client (1.2.0)
gensim (0.12.4)
genson (0.1.0)
gevent (1.1.2)
gffutils (0.8.2)
glove-python (0.1.0)
GraphLab-Create (2.1)
GraphLab-Create-License (2.1)
graphviz (0.4.3)
greenlet (0.4.10)
h5py (2.5.0)
hgapi (1.7.1)
hiclib (0.0.0)
hifive (1.2.1)
hmmlearn (0.2.0)
html (1.16)
html5lib (0.95)
HTSeq (0.6.1p1)
httpretty (0.8.10)
humann2 (0.9.4)
iced (0.2.2)
idna (2.1)
illuminate (0.4.3.2)
intervaltree (2.0.4)
ipaddress (1.0.17)
ipdb (0.7)
ipykernel (4.3.1)
ipyparallel (5.0.1)
ipython (4.2.0)
ipython-genutils (0.1.0)
ipywidgets (5.1.3)
isodate (0.4.9)
itsdangerous (0.24)
jdcal (1.0)
Jinja2 (2.8)
jmespath (0.5.0)
joblib (0.8.4)
jsonschema (2.5.1)
jupyter (1.0.0)
jupyter-client (4.2.2)
jupyter-console (4.1.1)
jupyter-core (4.1.0)
Keras (1.0.8)
khmer (2.0)
liac-arff (2.1.0)
linecache2 (1.0.0)
lockfile (0.9.1)
lxml (3.3.5)
MACS (1.4.3)
MACS2 (2.1.1.20160309)
mageck (0.5.4)
mahotas (1.0.4)
MarkupSafe (0.23)
matplotlib (2.0.0)
matplotlib-venn (0.8)
MDP (3.5)
memory-profiler (0.31)
metaseq (0.5.5.1)
mirnylib (0.0.0)
misopy (0.4.9)
mistune (0.7.2)
mock (2.0.0)
mpi4py (1.3)
mpmath (0.19)
multipledispatch (0.4.9)
MySQL-python (1.2.5)
natsort (3.5.6)
nbconvert (4.2.0)
nbformat (4.0.1)
ndg-httpsclient (0.4.2)
nest (1.3.0)
networkx (1.10)
ngslib (1.1.18)
nibabel (1.3.0)
nipype (0.10.0)
nltk (3.0.5)
nose (1.3.0)
notebook (4.2.0)
NucleoATAC (0.3.3)
numexpr (2.4)
numpy (1.12.0)
openpyxl (2.1.1)
pandas (0.17.1)
parcel (0.1.13)
parse (1.6.6)
pathlib2 (2.1.0)
patsy (0.3.0)
pbr (1.10.0)
pexpect (4.0.1)
pickleshare (0.7.2)
PICOS (1.1.1)
PIL (1.1.7)
Pillow (3.0.0)
pip (9.0.1)
pique (0.1.7)
poretools (0.6.0)
prettytable (0.7.2)
progressbar (2.3)
progressbar2 (3.10.1)
protobuf (3.0.0b2)
psclient (2.0)
psutil (2.2.1)
ptyprocess (0.5.1)
pyasn1 (0.1.9)
pybedtools (0.6.4)
pyBigWig (0.2.8)
pycassa (1.11.0)
pycparser (2.17)
pycurl (7.19.5.1)
pyeeg (0.4.0)
pygccxml (1.7.5)
Pygments (1.6)
PyGreSQL (4.1.1)
pymc (2.3.4)
pymongo (2.6.3)
pynast (1.2.2)
pynisher (0.4.2)
pyOpenSSL (16.2.0)
pyparsing (2.1.10)
pyqi (0.3.1)
pyrfr (0.2.0)
pysam (0.10.0)
python-dateutil (2.6.0)
python-utils (2.0.0)
pytz (2016.10)
PyVCF (0.6.7)
PyYAML (3.10)
pyzmq (15.2.0)
qc (0.1)
qcli (0.1.0)
qtconsole (4.2.1)
rdflib (4.1.0)
rdflib-jsonld (0.4.0)
requests (2.9.1)
requests-toolbelt (0.7.0)
riak (2.0.2)
riak-pb (1.4.4.0)
rmats2sashimiplot (1.0.0)
rosetta (0.3)
rpy2 (2.3.9)
rsa (3.1.2)
RSeQC (2.6.1)
ruamel.ordereddict (0.4.9)
ruamel.yaml (0.11.11)
ruffus (2.2)
sampy (1.2.1)
schema-salad (1.14.20160708181155)
scikit-image (0.11.3)
scikit-learn (0.18)
scipy (0.18.1)
screed (0.9)
seaborn (0.7.0)
setuptools (31.0.0)
Shapely (1.5.15)
shellescape (3.4.1)
sima (1.3.0)
simplegeneric (0.8.1)
simplejson (3.6.5)
singledispatch (3.4.0.3)
six (1.10.0)
sklearn (0.0)
smac (0.0.1)
smart-open (1.3.3)
snp-pipeline (0.3.4)
sortedcontainers (1.5.7)
SPARQLWrapper (1.5.2)
Sphinx (1.2.1)
sphinxcontrib-programoutput (0.8)
SphinxReport (2.0)
spyder (2.3.9)
SQLAlchemy (0.9.2)
sseclient (0.0.8)
statsmodels (0.6.1)
subprocess32 (3.2.7)
tables (3.1.1)
TADLib (0.2.5.post2)
tadtool (0.61)
termcolor (1.1.0)
terminado (0.6)
Theano (0.6.0)
threadpool (1.2.7)
thrift (0.9.1)
thunder (0.1.0)
tornado (4.3)
traceback2 (1.4.0)
traitlets (4.2.1)
traits (4.4.0)
typing (3.5.2.2)
unittest2 (1.1.0)
urlgrabber (3.9.1)
urllib3 (1.19.1)
virtualenv (15.0.3)
web.py (0.37)
weblogo (3.3)
websocket (0.2.1)
Werkzeug (0.11.11)
wheel (0.29.0)
widgetsnbextension (1.2.2)
xgboost (0.4a30)
xlrd (0.9.3)
xlwt (0.7.5)
xopen (0.1.1)
AliciaSchep commented 7 years ago

I've found that installing in a virtual environment can help with some dependency issues. In particular, in some cases I have found that when installing Cython using pip that an older version of Cython installed previously would actually get imported instead of the newly installed version (but the presence of the new version was what was registered when checking package dependencies upon installation). Setting up a virtual environment and installing all the dependencies fresh in that environment solved the issue.

I haven't gotten a chance yet to test the package against all the versions you've listed, but that will be my next step in trying to reproduce the error.

igordot commented 7 years ago

I had trouble using virtualenv. After I activate it, pip still tries to install to the usual path and not to the local path. Even if I force it to install locally with --ignore-installed and I see the packages added to the site-packages in the environment, pip list still shows the old versions of the packages. Thus, virtualenv actually introduces new problems for me, so I'll need some time to troubleshoot those.

igordot commented 7 years ago

I resolved my virtualenv issues. This is my pip list before running NucleoATAC, which now looks fine to me:

appdirs (1.4.0)
cycler (0.10.0)
Cython (0.25.2)
functools32 (3.2.3.post2)
matplotlib (2.0.0)
NucleoATAC (0.3.4)
numpy (1.12.0)
packaging (16.8)
pip (9.0.1)
pyparsing (2.1.10)
pysam (0.10.0)
python-dateutil (2.6.0)
pytz (2016.10)
scipy (0.18.1)
setuptools (34.2.0)
six (1.10.0)
subprocess32 (3.2.7)
wheel (0.30.0a0)

I updated to the latest version and ran again:

nucleoatac version 0.3.4
start run at: 2017-02-26 23:08
---------Step1: Computing Occupancy and Nucleosomal Insert Distribution---------
Making figure
---------Step2: Processing Vplot------------------------------------------------
---------Step3: Obtaining nucleosome signal and calling positions---------------
---------Step4: Making combined nucleosome position map ------------------------
---------Step5: Calling NFR positions-------------------------------------------
end run at: 2017-02-26 23:11

Unfortunately, I am still getting 0 in nucpos.bed and 160 lines in occpeaks.bed.

AliciaSchep commented 7 years ago

I unfortunately still can't reproduce that output or the failing test with those package versions, making getting to the bottom of this issue difficult.

Do any of the other tests fail? With v0.3.4 all the tests can be run with "python tests.py" in the main directory?

igordot commented 7 years ago

I got some fails with that.

$ python tests.py 
....................FF..
======================================================================
FAIL: Make sure variance calculation is close to what is obtained by simulation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/NucleoATAC/tests/test_var.py", line 33, in test_sd1
    self.assertTrue(abs(sd1-sd2)<0.05*sd1)
AssertionError: False is not true

======================================================================
FAIL: Make sure variance calculation is same as would be obtained through alternate calculation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/path/NucleoATAC/tests/test_var.py", line 43, in test_sd2
    self.assertTrue(abs(sd1-sd2)<0.001*sd1)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 24 tests in 386.461s

FAILED (failures=2)
AliciaSchep commented 7 years ago

Thanks, it looks like those are the same tests from tests/test_var.py run earlier. Seem like the problem is likely with the calculateCov function in the nucleoatac/multinomial_cov.pyx file, although I am confused as to why it would work differently on your system instead of mine (Ubuntu)

igordot commented 7 years ago

Do you know how I might be able to troubleshoot this? Is there a specific function that might be responsible that I can quickly test without requiring the whole test dataset?

AliciaSchep commented 7 years ago

For troubleshooting, perhaps figuring out more specifically why the tests are failing will help. If you try the following code within python in the main NucleoATAC directory (or edit file paths if in another directory):

import numpy as np
import nucleoatac.NucleosomeCalling as Nuc
import pyatac.VMat as V
from pyatac.chunkmat2d import BiasMat2D
from pyatac.chunk import ChunkList
from pyatac.bias import InsertionBiasTrack

bed_list = ChunkList.read('example/example.bed')
chunk = bed_list[0]
vmat = V.VMat.open('example/example.VMat')
biastrack = InsertionBiasTrack(chunk.chrom, chunk.start, chunk.end)
biastrack.read_track('example/example.Scores.bedgraph.gz')
biasmat = BiasMat2D(chunk.chrom,chunk.start+200,chunk.end-200,100,250)
biasmat.makeBiasMat(biastrack)
signaldist = Nuc.SignalDistribution(chunk.start+300,vmat,biasmat,35)

signaldist.simulateDist(5000)
sd1 = np.std(signaldist.scores)
sd2 = signaldist.analStd()

var_term = np.sum(signaldist.prob_mat*(1-signaldist.prob_mat)*signaldist.vmat.mat**2)
tmp = signaldist.prob_mat *signaldist.vmat.mat
cov_term = np.sum(np.outer(tmp,tmp))-np.sum(tmp**2)
sd3 = np.sqrt(signaldist.reads * (var_term - cov_term))

What are the values of sd1, sd2, and sd3?

(should all be approx equal, about 0.72)

igordot commented 7 years ago

Thanks for the test code.

I got an error for sd2:

>>> sd2 = signaldist.analStd()
nucleoatac/NucleosomeCalling.py:86: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(var)

Everything else was fine I think:

>>> sd1
0.723933371134454
>>> sd2
nan
>>> sd3
0.72610032185706019
AliciaSchep commented 7 years ago

Ok that helps confirm that the problem is with calculateCov function specifically.

After running the first two blocks of code above (through signaldist = ...) can you try:

import pyximport; pyximport.install(setup_args={"include_dirs":np.get_include()})
from nucleoatac.multinomial_cov import calculateCov

flatv = np.ravel(signaldist.vmat.mat)
var = calculateCov(signaldist.probs, flatv, signaldist.reads)

What does var give?

If we can't figure out why that function is not working on your system, one thing we can possibly do as a temporary measure is to create a branch where that function is replaced by the method used for calculating "sd3". That is the same calculation, just a slower implementation.

One thing I will try is updating the way the Cython code is included in the package to better follow recommendations (http://cython.readthedocs.io/en/latest/src/reference/compilation.html). I will also look into why compilation of the file on different system could potentially lead to problems.

igordot commented 7 years ago

I think you meant sd2, not sd3, right?

Here are the latest results:

>>> import pyximport; pyximport.install(setup_args={"include_dirs":np.get_include()})
(None, None)
>>> from nucleoatac.multinomial_cov import calculateCov
>>> 
>>> flatv = np.ravel(signaldist.vmat.mat)
>>> var = calculateCov(signaldist.probs, flatv, signaldist.reads)
>>> var
0.5272216774009897

Is that the expected value?

AliciaSchep commented 7 years ago

Yes that is the expected value. 'signaldist.analStd()' should just do those two computations and then take the square root of var -- very weird that it seems to work when calculateCov is called directly but not when it is called within 'signaldist.analStd()'. Perhaps the issue is with way pyximport is being used

sd2 in the example above is computed using the actual method employed in NucleoATAC. sd3 is computed using an alternate, slower method

igordot commented 7 years ago

Thanks for clarifying. I guess we at least know where the problem is.

Maybe the solution could be to just add an if-else based on signaldist.analStd() output? Not very elegant, but hopefully effective.

AliciaSchep commented 7 years ago

I am not sure what that ifelse would test for, as i am not sure whether in other cases the function could give an other type of spurious signal.

I am trying to update the code to change the way the Cython code is imported, although have run into some roadblocks and unfortunately don't have a good estimate of when i'll be able to get that working and/or see if it might fix this issue

igordot commented 7 years ago

By if-else, I meant if signaldist.analStd() returns nan, then you could fall back to the slow method. Obviously you would know all the possible caveats better than me, so I understand if that doesn't really make sense.

Anyway, thanks for following up. Hopefully you will be able to fix this at some point.

Parul-Kudtarkar commented 7 years ago

Same issue, nucpos.bed.gz files empty. looking forward to the updates. Thank you so much!

AliciaSchep commented 7 years ago

One possible fix is using a docker container... I started looking into how to make one, and found that it seems like someone else has already created one:

https://biocontainers.pro/registry/#/showImages?domain=quay&repository=nucleoatac&starred=false&modified=18%2F01%2F2017

Parul-Kudtarkar commented 7 years ago

Thanks for a very prompt response. Seems the issue is OS specific and docker could be possible fix. I will keep you posted.

igordot commented 7 years ago

Unfortunately, I am running this on a cluster and can't use Docker.

Obviously it would be nice to have a Docker image for people who can use that option.

vatese commented 7 years ago

I tried to solve the issue. Found something interesting.

It has to do with the version of your gcc C compiler.

In your home are there is a directory called .pyxbld. There is where the .c and .so files compiled from multinomial_cov.pyx are placed.

I deleted that .pyxbld directory from my home area (which will force nucleoatac to re-compile multinomial_cov.pyx) and changed my gcc version. By default CentOS 6 uses gcc version 4.4.7. I have environment modules, so I am able to load other gcc versions I have compiled in the past. When I loaded gcc 4.9.3, it works fine after multinomial_cov.pyx was built again.

Maybe the problem is that the C compiler has some issues with cython, which builds a non-functional multinomial_cov.pyx (even though it compiles fine).

Hope this helps to find the exact cause.

Cheers!

igordot commented 7 years ago

Thanks for the suggestion @vatese !

I was also on CentOS 6 with gcc 4.4. I switched to gcc 6.1. Everything seems to be working now.

test.occpeaks.bed.gz is 160 lines and test.nucpos.bed.gz is 130 lines.

nosetests tests/test_var.py comes out as OK.

signaldist.analStd() returns a value.

AliciaSchep commented 7 years ago

Thanks @vatese for figuring this out! I guess I'm still not sure how to actually make the code compatible with the earlier version of gcc, but at least now we know the source of the problem and one way to get around it

sameet commented 7 years ago

Hi @AliciaSchep , I tried doing what was suggested here. Briefly, I deleted the .pyxbld from my home directory. Then i used python tests.py. It forced re-compile of the C part of the code. I used the GCC 5.4.0. I still get the following error:

======================================================================
FAIL: Make sure variance calculation is close to what is obtained by simulation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs/ycga/home/sm2556/GitHub/NucleoATAC/tests/test_var.py", line 33, in test_sd1 
    self.assertTrue(abs(sd1-sd2)<0.05*sd1)
AssertionError: False is not true

======================================================================
FAIL: Make sure variance calculation is same as would be obtained through alternate calculation
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs/ycga/home/sm2556/GitHub/NucleoATAC/tests/test_var.py", line 43, in test_sd2
    self.assertTrue(abs(sd1-sd2)<0.001*sd1)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 24 tests in 205.134s

My python is 2.7.13. What should be my next step?

[sm2556@c13n06 NucleoATAC]$ strings -a ~/.pyxbld/lib.linux-x86_64 2.7/nucleoatac/multinomial_cov.so | grep "GCC: ("
GCC: (GNU) 5.4.0
AliciaSchep commented 7 years ago

@sameet Unfortunately I'm not sure what the best next step would be :slightly_frowning_face: I still don't know the underlying cause of this bug, other than that the gcc version seems important. Given that someone reported using gcc 6.1 without this problem, if it is feasible to try that version out that might be one possible thing to try.