aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

[BUG] Binarisation taking too long #188

Closed sykim14 closed 4 years ago

sykim14 commented 4 years ago

Describe the bug A clear and concise description of what the bug is.

Hi I am trying to binarise AUCell for further analysis. I am currently following codes in SCENICprotocol. I could successfully extract AUCell scroes in .csv files. However it is taking over 24 hours to binarise the data. (My single cell data contains 3282 cells and 442 regulons)

The code I used is... bin_mtx, thresholds = binarize(auc_mtx)

When I stop running it with ctrl+c, it gives me...

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/site-packages/pyscenic/binarization.py", line 79, in binarize
    thresholds = derive_thresholds(auc_mtx)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/site-packages/pyscenic/binarization.py", line 77, in derive_thresholds
    thrs = p.starmap( derive_threshold,  [ (auc_mtx, c, seed) for c in auc_mtx.columns ] )
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()

Thank you.


**Please complete the following information:**
- pySCENIC version: 0.10.2
- Installation method: conda
- Run environment: conda env
- OS: linux? I am running python in conda env through linux terminal
- Package versions: [obtain using `pip freeze`, `conda list`, or skip this if using Docker/Singularity]:

packages in environment at /data/syk3418/anaconda3/envs/lungcancer1:

#

Name Version Build Channel

_libgcc_mutex 0.1 main
_r-mutex 1.0.0 anacondar_1
arboreto 0.1.5 py_0 bioconda attrs 19.3.0 pypi_0 pypi binutils_impl_linux-64 2.33.1 he6710b0_7
binutils_linux-64 2.33.1 h9595d00_15
bioconductor-annotationdbi 1.48.0 r36_0 bioconda bioconductor-beachmat 2.2.0 r36he1b5a44_0 bioconda bioconductor-biobase 2.46.0 r36h516909a_0 bioconda bioconductor-biocgenerics 0.32.0 r36_0 bioconda bioconductor-biocparallel 1.20.0 r36he1b5a44_0 bioconda bioconductor-delayedarray 0.12.0 r36h516909a_0 bioconda bioconductor-dropletutils 1.6.1 r36he1b5a44_0 bioconda bioconductor-edger 3.28.0 r36hc6cf775_1 bioconda bioconductor-genomeinfodb 1.22.0 r36_0 bioconda bioconductor-genomeinfodbdata 1.2.2 r36_0 bioconda bioconductor-genomicranges 1.38.0 r36h516909a_0 bioconda bioconductor-hdf5array 1.14.0 r36h516909a_0 bioconda bioconductor-iranges 2.20.0 r36h516909a_0 bioconda bioconductor-limma 3.42.0 r36h516909a_0 bioconda bioconductor-mast 1.12.0 r36_0 bioconda bioconductor-rhdf5 2.30.0 r36he1b5a44_0 bioconda bioconductor-rhdf5lib 1.8.0 r36h516909a_0 bioconda bioconductor-s4vectors 0.24.0 r36h516909a_0 bioconda bioconductor-singlecellexperiment 1.8.0 r36_0 bioconda bioconductor-summarizedexperiment 1.16.0 r36_0 bioconda bioconductor-xvector 0.26.0 r36h516909a_0 bioconda bioconductor-zlibbioc 1.32.0 r36h516909a_0 bioconda blas 2.11 openblas conda-forge bokeh 2.1.1 py37_0
boltons 20.2.0 pypi_0 pypi bwidget 1.9.11 1
bzip2 1.0.8 h7b6447c_0
ca-certificates 2020.6.20 hecda079_0 conda-forge cairo 1.16.0 hcf35c78_1003 conda-forge certifi 2020.6.20 py37hc8dfbb8_0 conda-forge click 7.1.2 py_0
cloudpickle 1.5.0 py_0
colormath 3.0.0 py_2 conda-forge curl 7.69.1 hbc83047_0
cycler 0.10.0 py37_0
cytoolz 0.10.1 py37h7b6447c_0
dask 1.0.0 pypi_0 pypi dask-core 2.20.0 py_0
decorator 4.4.2 py_0
dill 0.3.2 pypi_0 pypi distributed 1.28.1 pypi_0 pypi fontconfig 2.13.1 h86ecdb6_1001 conda-forge freetype 2.10.2 h5ab3b9f_0
fribidi 1.0.9 h7b6447c_0
frozendict 1.2 pypi_0 pypi fsspec 0.7.4 py_0
gcc_impl_linux-64 7.3.0 habb00fd_1
gcc_linux-64 7.3.0 h553295d_15
gfortran_impl_linux-64 7.3.0 hdf63c60_1
gfortran_linux-64 7.3.0 h553295d_15
glib 2.65.0 h3eb4bd4_0
gmp 6.2.0 he1b5a44_2 conda-forge graphite2 1.3.14 h23475e2_0
gsl 2.6 h294904e_0 conda-forge gxx_impl_linux-64 7.3.0 hdf63c60_1
gxx_linux-64 7.3.0 h553295d_15
h5py 2.10.0 py37hd6299e0_1
harfbuzz 2.4.0 h9f30f68_3 conda-forge hdf5 1.10.6 hb1b8bf9_0
heapdict 1.0.1 py_0
icu 64.2 he1b5a44_1 conda-forge interlap 0.2.6 pypi_0 pypi jinja2 2.11.2 py_0
joblib 0.16.0 py_0
jpeg 9d h516909a_0 conda-forge kiwisolver 1.2.0 py37hfd86e86_0
krb5 1.17.1 h173b8e3_0
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libblas 3.8.0 11_openblas conda-forge libcblas 3.8.0 11_openblas conda-forge libcurl 7.69.1 h20c2e04_0
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libiconv 1.15 h63c8f33_5
liblapack 3.8.0 11_openblas conda-forge liblapacke 3.8.0 11_openblas conda-forge libllvm8 8.0.1 hc9558a2_0 conda-forge libllvm9 9.0.1 h4a3c616_1
libopenblas 0.3.6 h5a2b251_2 anaconda libpng 1.6.37 hbc83047_0
libssh2 1.9.0 h1ba5d50_1
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 2.32.1 h14c3975_1000 conda-forge libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hee79883_0 conda-forge llvmlite 0.33.0 py37hc6ec683_1
locket 0.2.0 py37_1
loompy 2.0.17 py_0 conda-forge lz4-c 1.9.2 he6710b0_0
make 4.2.1 h1bed415_1
markupsafe 1.1.1 py37h14c3975_1
matplotlib 3.2.2 1 conda-forge matplotlib-base 3.2.2 py37h30547a4_0 conda-forge msgpack-python 1.0.0 py37hfd86e86_1
multiprocessing-on-dill 3.5.0a4 pypi_0 pypi ncurses 6.2 he6710b0_1
networkx 2.4 py_0
numba 0.50.1 py37h0573a6f_1
numpy 1.17.0 py37h99e49ec_0 r numpy-base 1.17.0 py37h2f8d375_0 r olefile 0.46 py37_0
openssl 1.1.1g h516909a_0 conda-forge packaging 20.4 py_0
pandas 0.25.3 py37hb3f55d8_0 conda-forge pango 1.42.4 h7062337_4 conda-forge parallel 20200522 0 conda-forge partd 1.1.0 py_0
patsy 0.5.1 py37_0
pcre 8.44 he6710b0_0
pcre2 10.34 h2f06484_0 conda-forge perl 5.26.2 h14c3975_0
pillow 7.2.0 py37hb39fc2d_0
pip 20.1.1 py37_1
pixman 0.38.0 h7b6447c_0
psutil 5.7.0 py37h7b6447c_0
pyarrow 0.16.0 pypi_0 pypi pycairo 1.19.1 py37h01af8b0_3 conda-forge pyparsing 2.4.7 py_0
pyscenic 0.10.2 pypi_0 pypi python 3.7.7 hcff3b4d_5
python-dateutil 2.8.1 py_0
python_abi 3.7 1_cp37m conda-forge pytz 2020.1 py_0
pyyaml 5.3.1 py37h7b6447c_1
r-abind 1.4_5 r36h6115d3f_0 r r-apcluster 1.4.8 r36he1b5a44_1 conda-forge r-ape 5.3 r36h29659fb_0 r r-askpass 1.0 r36h14c3975_0 r r-assertthat 0.2.1 r36h6115d3f_0 eugene_t r-backports 1.1.4 r36h96ca727_0 r r-base 3.6.3 h316533a_2 conda-forge r-base64enc 0.1_3 r36h96ca727_4 r r-bh 1.69.0_1 r36h6115d3f_0 r r-bibtex 0.4.2 r36h96ca727_1 r r-biocmanager 1.30.10 r36h6115d3f_1 conda-forge r-bit 1.1_14 r36h96ca727_0 r r-bit64 0.9_7 r36h96ca727_0 r r-bitops 1.0_6 r36h96ca727_4 r r-blob 1.1.1 r36h6115d3f_0 r r-brew 1.0_6 r36h6115d3f_4 r r-callr 3.2.0 r36h6115d3f_0 r r-catools 1.17.1.2 r36h29659fb_0 r r-cli 1.1.0 r36h6115d3f_0 r r-clipr 0.6.0 r36h6115d3f_0 r r-clisymbols 1.2.0 r36h6115d3f_0 r r-cluster 2.1.0 r36ha65eedd_0 eugene_t r-codetools 0.2_16 r36h6115d3f_0 r r-colorspace 1.4_1 r36h96ca727_0 eugene_t r-conos 1.1.2 r36hf484d3e_0 eugene_t r-cowplot 1.0.0 r36h6115d3f_2 conda-forge r-crayon 1.3.4 r36h6115d3f_0 eugene_t r-crosstalk 1.0.0 r36h6115d3f_0 r r-curl 3.3 r36h96ca727_0 r r-d3heatmap 0.6.1.2 r36h6115d3f_1003 conda-forge r-data.table 1.12.2 r36h96ca727_0 r r-dbi 1.0.0 r36h6115d3f_0 r r-dendextend 1.13.4 r36h6115d3f_1 conda-forge r-dendsort 0.3.3 r36h6115d3f_0 r r-desc 1.2.0 r36h6115d3f_0 r r-devtools 2.0.2 r36h6115d3f_0 r r-digest 0.6.21 r36h29659fb_0 eugene_t r-doparallel 1.0.15 r36h6115d3f_1 conda-forge r-dorng 1.8.2 r36h6115d3f_1 conda-forge r-dotcall64 1.0_0 r36ha65eedd_0 r r-dplyr 0.8.0.1 r36h29659fb_0 r r-dqrng 0.2.1 r36h0357c0b_2 conda-forge r-egg 0.4.5 r36h6115d3f_2 conda-forge r-entropy 1.2.1 r36h6115d3f_0 r r-fansi 0.4.0 r36h96ca727_0 eugene_t r-fastcluster 1.1.25 r36h29659fb_0 r r-fields 10.3 r36h9bbef5b_1 conda-forge r-fitdistrplus 1.1_1 r36h6115d3f_0 conda-forge r-foreach 1.4.7 r36h6115d3f_0 eugene_t r-formatr 1.6 r36h6115d3f_0 r r-fs 1.2.7 r36h29659fb_0 r r-futile.logger 1.4.3 r36h6115d3f_1003 conda-forge r-futile.options 1.0.1 r36h6115d3f_0 r r-future 1.18.0 r36h6115d3f_0 conda-forge r-future.apply 1.6.0 r36h6115d3f_0 conda-forge r-gbrd 0.4_11 r36h6115d3f_0 r r-gclus 1.3.2 r36h6115d3f_0 r r-gdata 2.18.0 r36h6115d3f_0 r r-gdtools 0.2.2 r36h36050f4_1 conda-forge r-ggplot2 3.2.1 r36h6115d3f_0 eugene_t r-ggrepel 0.8.1 r36h29659fb_0 eugene_t r-ggridges 0.5.2 r36h6115d3f_2 conda-forge r-gh 1.0.1 r36h6115d3f_0 r r-git2r 0.25.2 r36h96ca727_0 r r-globals 0.12.5 r36h6115d3f_1 conda-forge r-glue 1.3.1 r36h96ca727_0 eugene_t r-gplots 3.0.1.1 r36h6115d3f_0 r r-gridextra 2.3 r36h6115d3f_0 r r-grr 0.9.5 r36h0357c0b_1004 conda-forge r-gtable 0.3.0 r36h6115d3f_0 eugene_t r-gtools 3.8.1 r36h96ca727_0 r r-heatmaply 1.1.0 r36h6115d3f_1 conda-forge r-hexbin 1.27.2 r36ha65eedd_0 r r-hms 0.4.2 r36h6115d3f_0 r r-htmltools 0.3.6 r36h29659fb_0 r r-htmlwidgets 1.3 r36h6115d3f_0 r r-httpuv 1.5.1 r36h29659fb_0 r r-httr 1.4.0 r36h6115d3f_0 r r-ica 1.0_2 r36h6115d3f_0 eugene_t r-igraph 1.2.4.1 r36h80f5a37_0 r r-ini 0.3.1 r36h6115d3f_0 r r-irlba 2.3.3 r36h96ca727_0 eugene_t r-iterators 1.0.10 r36h6115d3f_0 r r-jsonlite 1.6 r36h96ca727_0 r r-kernsmooth 2.23_15 r36ha65eedd_4 r r-labeling 0.3 r36h6115d3f_4 r r-lambda.r 1.2.3 r36h6115d3f_0 r r-later 0.8.0 r36h29659fb_0 r r-lattice 0.20_38 r36h96ca727_0 eugene_t r-lazyeval 0.2.2 r36h96ca727_0 eugene_t r-listenv 0.8.0 r36h6115d3f_1 conda-forge r-lmtest 0.9_36 r36ha65eedd_0 r r-locfit 1.5_9.4 r36hcdcec82_1 conda-forge r-lsei 1.2_0.1 r36h6e990d7_0 conda-forge r-magrittr 1.5 r36h6115d3f_4 r r-maps 3.3.0 r36h96ca727_0 r r-mass 7.3_51.4 r36h96ca727_0 eugene_t r-matrix 1.2_17 r36h96ca727_0 r r-matrix.utils 0.9.8 r36h6115d3f_1 conda-forge r-matrixstats 0.54.0 r36h96ca727_0 r r-memoise 1.1.0 r36h6115d3f_0 r r-metap 1.1 r36h6115d3f_2 conda-forge r-mgcv 1.8_29 r36h96ca727_0 eugene_t r-mime 0.6 r36h96ca727_0 r r-modes 0.7.0 r36h6115d3f_1002 conda-forge r-munsell 0.5.0 r36h6115d3f_0 eugene_t r-nlme 3.1_141 r36ha65eedd_0 eugene_t r-npsurv 0.4_0.1 r36h6115d3f_0 conda-forge r-openssl 1.3 r36h96ca727_0 r r-pagoda2 0.1.0 r36hf484d3e_0 eugene_t r-patchwork 1.0.0 r36h6115d3f_1 conda-forge r-pbapply 1.4_2 r36h6115d3f_0 r r-pillar 1.3.1 r36h6115d3f_0 r r-pkgbuild 1.0.3 r36h6115d3f_0 r r-pkgconfig 2.0.3 r36h6115d3f_0 eugene_t r-pkgload 1.0.2 r36h29659fb_0 r r-pkgmaker 0.31.1 r36h6115d3f_1 conda-forge r-plogr 0.2.0 r36h6115d3f_0 r r-plotly 4.9.0 r36h6115d3f_0 r r-plyr 1.8.4 r36h29659fb_0 eugene_t r-png 0.1_7 r36h96ca727_0 r r-prettyunits 1.0.2 r36h6115d3f_0 r r-processx 3.3.0 r36h96ca727_0 r r-progress 1.2.0 r36h6115d3f_0 r r-promises 1.0.1 r36h29659fb_0 r r-ps 1.3.0 r36h96ca727_0 r r-purrr 0.3.2 r36h96ca727_0 r r-qap 0.1_1 r36h9bbef5b_1005 conda-forge r-r.methodss3 1.7.1 r36h6115d3f_0 r r-r.oo 1.22.0 r36h6115d3f_0 r r-r.utils 2.8.0 r36h6115d3f_0 r r-r6 2.4.0 r36h6115d3f_0 r r-rann 2.6.1 r36h0357c0b_2 conda-forge r-rcmdcheck 1.3.2 r36h6115d3f_0 r r-rcolorbrewer 1.1_2 r36h6115d3f_0 eugene_t r-rcpp 1.0.2 r36h29659fb_0 eugene_t r-rcpparmadillo 0.9.700.2.0 r36h29659fb_0 eugene_t r-rcppeigen 0.3.3.5.0 r36h29659fb_0 r r-rcppprogress 0.4.1 r36h6115d3f_0 r r-rcurl 1.98_1.2 r36hcdcec82_1 conda-forge r-rdpack 1.0.0 r36h6115d3f_0 conda-forge r-readr 1.3.1 r36h0357c0b_1004 conda-forge r-registry 0.5_1 r36h6115d3f_0 r r-remotes 2.0.4 r36h6115d3f_0 r r-reshape2 1.4.3 r36h29659fb_0 eugene_t r-reticulate 1.12 r36h29659fb_0 r r-rjson 0.2.20 r36h29659fb_0 r r-rlang 0.4.0 r36h96ca727_0 eugene_t r-rmtstat 0.3 r36h6115d3f_0 r r-rngtools 1.5 r36h6115d3f_1 conda-forge r-rocr 1.0_11 r36h6115d3f_1 conda-forge r-rook 1.1_1 r36h96ca727_0 r r-rprojroot 1.3_2 r36h6115d3f_0 r r-rsqlite 2.1.1 r36h29659fb_0 r r-rstudioapi 0.10 r36h6115d3f_0 r r-rsvd 1.0.2 r36h6115d3f_0 r r-rtsne 0.15 r36h29659fb_0 eugene_t r-scales 1.0.0 r36h29659fb_0 eugene_t r-sctransform 0.2.1 r36h0357c0b_1 conda-forge r-sdmtools 1.1_221.2 r36h516909a_1 conda-forge r-seriation 1.2_8 r36h9bbef5b_1 conda-forge r-sessioninfo 1.1.1 r36h6115d3f_0 r r-seurat 3.0.2 r36h0357c0b_1 bioconda r-shiny 1.3.2 r36h6115d3f_0 r r-sitmo 2.0.1 r36h29659fb_0 r r-snow 0.4_3 r36h6115d3f_0 eugene_t r-sourcetools 0.1.7 r36h29659fb_0 r r-spam 2.2_2 r36ha65eedd_0 r r-stringi 1.4.3 r36h29659fb_0 eugene_t r-stringr 1.4.0 r36h6115d3f_0 eugene_t r-survival 2.44_1.1 r36h96ca727_0 r r-svglite 1.2.3.2 r36h0357c0b_0 conda-forge r-sys 3.2 r36h96ca727_0 r r-systemfonts 0.2.3 r36hc9cbd26_0 conda-forge r-tibble 2.1.3 r36h96ca727_0 eugene_t r-tidyr 0.8.3 r36h29659fb_0 r r-tidyselect 0.2.5 r36h29659fb_0 r r-triebeard 0.3.0 r36h29659fb_0 r r-tsne 0.1_3 r36h6115d3f_0 r r-tsp 1.1_7 r36h96ca727_0 r r-urltools 1.7.3 r36h0357c0b_2 conda-forge r-usethis 1.5.0 r36h6115d3f_0 r r-utf8 1.1.4 r36h96ca727_0 eugene_t r-viridis 0.5.1 r36h6115d3f_0 r r-viridislite 0.3.0 r36h6115d3f_0 eugene_t r-webgestaltr 0.4.3 r36h0357c0b_1 conda-forge r-webshot 0.5.1 r36h6115d3f_0 r r-whisker 0.3_2 r36h6115d3f_4 r r-withr 2.1.2 r36h6115d3f_0 eugene_t r-xopen 1.0.0 r36h6115d3f_0 r r-xtable 1.8_4 r36h6115d3f_0 r r-yaml 2.2.0 r36h96ca727_0 r r-zoo 1.8_6 r36h96ca727_0 r readline 8.0 h7b6447c_0
scikit-learn 0.23.1 py37h7ea95a0_0
scipy 1.5.0 py37habc2bb6_0
seaborn 0.10.1 1 conda-forge seaborn-base 0.10.1 py_1 conda-forge sed 4.7 h1bed415_1000 conda-forge setuptools 49.2.0 py37_0
six 1.15.0 py_0
sortedcontainers 2.2.2 py_0
spectra 0.0.11 py_1 conda-forge sqlite 3.32.3 h62c20be_0
statsmodels 0.11.1 py37h8f50634_2 conda-forge tbb 2020.0.133 pypi_0 pypi tblib 1.6.0 py_0
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
tktable 2.10 h14c3975_0
toolz 0.10.0 py_0
tornado 6.0.4 py37h7b6447c_1
tqdm 4.47.0 pypi_0 pypi typing 3.7.4.3 py37_0
typing_extensions 3.7.4.2 py_0
umap-learn 0.4.3 py37hc8dfbb8_0 conda-forge wheel 0.34.2 py37_0
xorg-kbproto 1.0.7 h14c3975_1002 conda-forge xorg-libice 1.0.10 h516909a_0 conda-forge xorg-libsm 1.2.3 h84519dc_1000 conda-forge xorg-libx11 1.6.9 h516909a_0 conda-forge xorg-libxext 1.3.4 h516909a_0 conda-forge xorg-libxrender 0.9.10 h516909a_1002 conda-forge xorg-renderproto 0.11.1 h14c3975_1002 conda-forge xorg-xextproto 7.3.0 h14c3975_1002 conda-forge xorg-xproto 7.0.31 h14c3975_1007 conda-forge xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
zict 2.0.0 py_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.4 h0b5b093_3

cflerin commented 4 years ago

Hi @sykim14 ,

It's not clear from your error what's going on unfortunately. Is the process actually running and using CPU time for the entire 24 hours?

One thing to try: If you're calling that function directly, the default is to use one core, could you try increasing it (if you have the resources available)?

bin_mtx, thresholds = binarize(auc_mtx, num_workers=20)
sykim14 commented 4 years ago

@cflerin Hi, thank you for your reply

I tried changing num_workers but when I 'top' it in linux terminal, I can see my number of cores increasing, but they decrease back to 1 after few seconds, so I don't think it is running properly.

The error messages after ctrl+c doesn't change no matter how long I run the code for, or when I increase num_workers.

How long is this binarisation step usually take?

Thank you

cflerin commented 4 years ago

I just tried this on a matrix of 10k cells by 375 regulons and it completed in 5 minutes using 8 workers. Can you check if your matrix looks right (cells in rows, regulons in columns), like this:

                    ARNTL2_(+)  ASCL2_(+)  ATF1_(+)  ATF3_(+)  ATF4_(+)  ...  \
AAACCCAAGCGCCCAT-1    0.022281   0.000000  0.050656  0.046860  0.030110  ...   
AAACCCACAGAGTTGG-1    0.023544   0.000000  0.030828  0.068298  0.041143  ...   
AAACCCACAGGTATGG-1    0.022685   0.138494  0.062297  0.033759  0.029247  ...   
AAACCCACATAGTCAC-1    0.042440   0.000000  0.051211  0.039889  0.030493  ...   
AAACCCACATCCAATG-1    0.044689   0.115552  0.037881  0.043455  0.018174  ...   
...                        ...        ...       ...       ...       ...  ...   
TTTGTTGGTCTGTAAC-1    0.015865   0.000000  0.051778  0.052338  0.043716  ...   
TTTGTTGGTGCGTCGT-1    0.021624   0.000000  0.044749  0.039394  0.029402  ...   
TTTGTTGGTTTGAACC-1    0.009271   0.000000  0.037608  0.038684  0.050083  ...   
TTTGTTGTCCAAGCCG-1    0.014399   0.000000  0.052829  0.042617  0.045972  ...   
TTTGTTGTCTTACTGT-1    0.015511   0.000000  0.047092  0.041113  0.044001  ...
sykim14 commented 4 years ago

@cflerin Hi, I tried printing aucmtx and it gives me a matrix very similar to yours. It seems like the only difference is that I have 'Regulon' and 'Cell' written as part of the matrix and I am missing '' in Reuglon name. Would this be a problem? Also, I saved this to csv. and found that some 0 are written as 0.0 or 0, they were all in different decimal places.

Would one of these be reasons for my error?

>>> print(auc_mtx)
Regulon                         AHR(+)  ARID3A(+)  ...  ZNF91(+)  ZSCAN29(+)
Cell                                               ...                      
SAMPLE_AAACCTGGTGGCAAAC.1_2   0.000000   0.044381  ...    0.0000         0.0
SAMPLE_AAACGGGAGAGAGCTC.1_2   0.046174   0.034884  ...    0.0000         0.0
SAMPLE_AAACGGGAGATGCGAC.1_2   0.019101   0.021464  ...    0.0000         0.0
SAMPLE_AAAGTAGAGTGAAGAG.1_2   0.014412   0.012412  ...    0.0000         0.0
SAMPLE_AAATGCCGTCCATGAT.1_2   0.021487   0.028292  ...    0.0000         0.0
...                                ...        ...  ...       ...         ...
SAMPLE_TTTGCGCCAGACGTAG.1_11  0.021359   0.028727  ...    0.0000         0.0
SAMPLE_TTTGGTTAGCTACCGC.1_11  0.031224   0.015983  ...    0.0000         0.0
SAMPLE_TTTGGTTGTGCAGACA.1_11  0.027919   0.031384  ...    0.0535         0.0
SAMPLE_TTTGTCAAGGTCATCT.1_11  0.122827   0.032754  ...    0.0000         0.0
SAMPLE_TTTGTCACATGGTCAT.1_11  0.026570   0.014897  ...    0.0000         0.0

[3782 rows x 422 columns]
cflerin commented 4 years ago

The underscore in the regulon names won't be a problem here, and I think pandas should handle the decimals when reading the file. As long as the cell column is the data frame index and not a data column, it should be fine... you could check that auc_mtx.iloc[:,0] returns something like:

AAACCCAAGCGCCCAT-1    0.041676
AAACCCACAGAGTTGG-1    0.072727
AAACCCACAGGTATGG-1    0.023048
AAACCCACATAGTCAC-1    0.039112
AAACCCACATCCAATG-1    0.026668
                        ...   
TTTGTTGGTCTGTAAC-1    0.043771
TTTGTTGGTGCGTCGT-1    0.044097
TTTGTTGGTTTGAACC-1    0.043542
TTTGTTGTCCAAGCCG-1    0.031581
TTTGTTGTCTTACTGT-1    0.044414
Name: AHR_(+), Length: 10280, dtype: float64

I loaded mine with

auc_mtx = pd.read_csv('auc_mtx.tsv', sep='\t', index_col=0)

Did you maybe merge two matrices together? You can check for NAs from the join, which might cause this, with: pd.isna(auc_mtx).sum() (If you find any, I would recommend setting them to 0)

sykim14 commented 4 years ago

@cflerin Thanks for reply again:)

auc_mtx.iloc[:,0] gives me...

>>> auc_mtx.iloc[:,0]
Cell
SAMPLE_AAACCTGGTGGCAAAC.1_2     0.000000
SAMPLE_AAACGGGAGAGAGCTC.1_2     0.046174
SAMPLE_AAACGGGAGATGCGAC.1_2     0.019101
SAMPLE_AAAGTAGAGTGAAGAG.1_2     0.014412
SAMPLE_AAATGCCGTCCATGAT.1_2     0.021487
                                  ...   
SAMPLE_TTTGCGCCAGACGTAG.1_11    0.021359
SAMPLE_TTTGGTTAGCTACCGC.1_11    0.031224
SAMPLE_TTTGGTTGTGCAGACA.1_11    0.027919
SAMPLE_TTTGTCAAGGTCATCT.1_11    0.122827
SAMPLE_TTTGTCACATGGTCAT.1_11    0.026570
Name: AHR(+), Length: 3782, dtype: float64

Which is similar to what you have got?

pd.isna(auc_mtx).sum() gives me

>>> pd.isna(auc_mtx).sum()
Regulon
AHR(+)        0
ARID3A(+)     0
ARNT(+)       0
ARNTL(+)      0
ASCL1(+)      0
             ..
ZNF84(+)      0
ZNF841(+)     0
ZNF891(+)     0
ZNF91(+)      0
ZSCAN29(+)    0
Length: 422, dtype: int64

I don't think I merged two matrices together. I think everything is set to 0?

Thank you

cflerin commented 4 years ago

Ok, I don't see anything obviously wrong. Are you reading this file from disk? Can you post the first few lines?

Could you possibly try the binarization step on a different compute environment (fresh install, different computer, etc.)?

sykim14 commented 4 years ago

@cflerin

Oh, I was printing auc_mtx from object, not reading it from a saved csv file. The code I used to read csv file is...

AUCELL_MTX_FNAME = os.path.join(DATA_FOLDER, 'cancer_mp_AUCMatrix.csv')
auc_mtx_ca = pd.read_csv(AUCELL_MTX_FNAME, index_col=0)

Oh, although I set index_col=0 , I get 'Cell' as a rowname. Does this mean 'Cell' is part of my data column, not index?

>>> auc_mtx_ca
                                AHR(+)  ARID3A(+)  ...  ZNF91(+)  ZSCAN29(+)
Cell                                               ...                      
SAMPLE_AAACCTGGTGGCAAAC.1_2   0.000000   0.044381  ...    0.0000         0.0
SAMPLE_AAACGGGAGAGAGCTC.1_2   0.046174   0.034884  ...    0.0000         0.0
SAMPLE_AAACGGGAGATGCGAC.1_2   0.019101   0.021464  ...    0.0000         0.0
SAMPLE_AAAGTAGAGTGAAGAG.1_2   0.014412   0.012412  ...    0.0000         0.0
SAMPLE_AAATGCCGTCCATGAT.1_2   0.021487   0.028292  ...    0.0000         0.0
...                                ...        ...  ...       ...         ...
SAMPLE_TTTGCGCCAGACGTAG.1_11  0.021359   0.028727  ...    0.0000         0.0
SAMPLE_TTTGGTTAGCTACCGC.1_11  0.031224   0.015983  ...    0.0000         0.0
SAMPLE_TTTGGTTGTGCAGACA.1_11  0.027919   0.031384  ...    0.0535         0.0
SAMPLE_TTTGTCAAGGTCATCT.1_11  0.122827   0.032754  ...    0.0000         0.0
SAMPLE_TTTGTCACATGGTCAT.1_11  0.026570   0.014897  ...    0.0000         0.0

[3782 rows x 422 columns]

I will try with a fresh installation.

Thank you

cflerin commented 4 years ago

I think the index column title is fine, and a similar matrix works for me here.

This seems to be related to #116 but I don't have any further suggestions right now, sorry.

sykim14 commented 4 years ago

@cflerin Hi, I was able to run binarize with a fresh environment!

Thank you so much for your help!