KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
48 stars 17 forks source link

Dependencies problem #26

Closed vmelichar closed 6 months ago

vmelichar commented 6 months ago

Hi,

I am trying to run 20/20+ on linux machine with following command:

snakemake -s Snakefile pretrained_predict -p --cores 40 --config mutations="/mnt/scratch/melichv/2020+/data/mutations_maftools.maf" data_dir="/mnt/scratch/melichv/2020+/data/" output_dir="/mnt/scratch/melichv/2020+/output/" trained_classifier="/mnt/scratch/melichv/2020+/data/2020plus_10k.Rdata"

Firstly, I was getting this error:

Failed to import the site module
Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site.py", line 545, in <module>
    main()
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site.py", line 531, in main
    known_paths = addusersitepackages(known_paths)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site.py", line 282, in addusersitepackages
    user_site = getusersitepackages()
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site.py", line 258, in getusersitepackages
    user_base = getuserbase() # this will also set USER_BASE
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site.py", line 248, in getuserbase
    USER_BASE = get_config_var('userbase')
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/sysconfig.py", line 609, in get_config_var
    return get_config_vars().get(name)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/sysconfig.py", line 558, in get_config_vars
    _init_posix(_CONFIG_VARS)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/sysconfig.py", line 429, in _init_posix
    _temp = __import__(name, globals(), locals(), ['build_time_vars'], 0)
ModuleNotFoundError: No module named '_sysconfigdata_x86_64_conda_linux_gnu'

I fixed this by solution posted here: https://stackoverflow.com/a/68685847. But then i encountered another problem and I do not know how to solve it.

Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/bin/mut_annotate", line 5, in <module>
    from prob2020.console.annotate import cli_main
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/annotate.py", line 12, in <module>
    import prob2020.cython.cutils as cutils
  File "__init__.pxd", line 918, in init prob2020.cython.cutils
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

It is probably a problem with version of some package. Could you please update the installation procedure?

Conda env:

# packages in environment at /home/melichv/miniconda3/envs/2020plus:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
_r-mutex                  1.0.0               anacondar_1
_sysroot_linux-64_curr_repodata_hack 3                   haa98f57_10
bcrypt                    3.2.0            py36h7b6447c_0
binutils_impl_linux-64    2.35.1               h27ae35d_9
binutils_linux-64         2.35.1              h454624a_30
blas                      1.0                         mkl
brotlipy                  0.7.0           py36h27cfd23_1003
bwidget                   1.9.11                        1
bzip2                     1.0.8                h7b6447c_0
c-ares                    1.19.1               h5eee18b_0
ca-certificates           2023.12.12           h06a4308_0
cairo                     1.14.12              h8948797_3
certifi                   2021.5.30        py36h06a4308_0
cffi                      1.14.0           py36h2e261b9_0
charset-normalizer        2.0.4              pyhd3eb1b0_0
cryptography              35.0.0           py36hd23ed53_0
curl                      7.67.0               hbc83047_0
cycler                    0.11.0             pyhd3eb1b0_0
dbus                      1.13.18              hb2f20db_0
docutils                  0.17.1           py36h06a4308_1
dropbox                   5.2.1                    py36_0    bioconda
expat                     2.5.0                h6a678d5_0
filechunkio               1.6                      py36_0    bioconda
fontconfig                2.14.1               h52c9d5c_1
freetype                  2.12.1               h4a9f257_0
fribidi                   1.0.10               h7b6447c_0
ftputil                   3.2                      py36_0    bioconda
gcc_impl_linux-64         7.5.0               h7105cf2_17
gcc_linux-64              7.5.0               h8f34230_30
gfortran_impl_linux-64    7.5.0               ha8c8e06_17
gfortran_linux-64         7.5.0               h96bb648_30
giflib                    5.2.1                h5eee18b_3
glib                      2.63.1               h5a9c865_0
graphite2                 1.3.14               h295c915_1
gsl                       2.4                  h14c3975_4
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
gxx_impl_linux-64         7.5.0               h0a5bf11_17
gxx_linux-64              7.5.0               hffc177d_30
harfbuzz                  1.8.8                hffaf4a1_0
icu                       58.2                 he6710b0_3
idna                      3.3                pyhd3eb1b0_0
intel-openmp              2022.1.0          h9e868ea_3769
jinja2                    3.0.3              pyhd3eb1b0_0
jpeg                      9e                   h5eee18b_1
kernel-headers_linux-64   3.10.0              h57e8cba_10
kiwisolver                1.3.1            py36h2531618_0
krb5                      1.16.4               h173b8e3_0
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.35.1               h7274673_9
libcurl                   7.67.0               h20c2e04_0
libdeflate                1.0                  h14c3975_1    bioconda
libedit                   3.1.20230828         h5eee18b_0
libev                     4.33                 h7f8727e_1
libffi                    3.2.1             hf484d3e_1007
libgcc-devel_linux-64     7.5.0               hbbeae57_17
libgcc-ng                 11.2.0               h1234567_1
libgfortran-ng            7.5.0               ha8ba4b0_17
libgfortran4              7.5.0               ha8ba4b0_17
libgomp                   11.2.0               h1234567_1
libnghttp2                1.52.0               ha637b67_1
libpng                    1.6.39               h5eee18b_0
libsodium                 1.0.18               h7b6447c_0
libssh2                   1.10.0               h37d81fd_2
libstdcxx-devel_linux-64  7.5.0               hf0c5c8d_17
libstdcxx-ng              11.2.0               h1234567_1
libtiff                   4.2.0                hecacb30_2
libuuid                   1.41.5               h5eee18b_0
libwebp                   1.2.4                h11a3e52_1
libwebp-base              1.2.4                h5eee18b_1
libxcb                    1.15                 h7f8727e_0
libxml2                   2.9.14               h74e7548_0
lz4-c                     1.9.4                h6a678d5_0
make                      4.2.1                h1bed415_1
markupsafe                2.0.1            py36h27cfd23_0
matplotlib                3.3.2                h06a4308_0
matplotlib-base           3.3.2            py36h817c723_0
mkl                       2018.0.3                      1
mkl_fft                   1.0.6            py36h7dd41cf_0
mkl_random                1.0.1            py36h4414c95_1
ncurses                   6.4                  h6a678d5_0
numpy                     1.15.4           py36h1d66e8a_0
numpy-base                1.15.4           py36h81de0dd_0
olefile                   0.46               pyhd3eb1b0_0
openssl                   1.1.1w               h7f8727e_0
pandas                    0.25.3           py36he6710b0_0
pango                     1.42.4               h049681c_0
paramiko                  2.8.1              pyhd3eb1b0_0
pcre                      8.45                 h295c915_0
pillow                    8.3.1            py36h5aabda8_0
pip                       21.2.2           py36h06a4308_0
pixman                    0.40.0               h7f8727e_1
probabilistic2020         1.2.3                    pypi_0    pypi
psutil                    5.8.0            py36h27cfd23_1
pycparser                 2.21               pyhd3eb1b0_0
pynacl                    1.4.0            py36h7b6447c_1
pyopenssl                 22.0.0             pyhd3eb1b0_0
pyparsing                 3.0.4              pyhd3eb1b0_0
pyqt                      5.9.2            py36h05f1152_2
pysam                     0.15.3           py36hda2845c_1    bioconda
pysftp                    0.2.9                    py36_0    bioconda
pysocks                   1.7.1            py36h06a4308_0
python                    3.6.10               h191fe78_1
python-dateutil           2.8.2              pyhd3eb1b0_0
pytz                      2021.3             pyhd3eb1b0_0
pyyaml                    5.4.1            py36h27cfd23_1
qt                        5.9.7                h5867ecd_1
r                         3.6.0                     r36_0
r-assertthat              0.2.1             r36h6115d3f_0
r-base                    3.6.1                h9bb98a2_1
r-bh                      1.69.0_1          r36h6115d3f_0
r-bit                     1.1_14            r36h96ca727_0
r-bit64                   0.9_7             r36h96ca727_0
r-blob                    1.1.1             r36h6115d3f_0
r-boot                    1.3_20            r36h6115d3f_0
r-class                   7.3_15            r36h96ca727_0
r-cli                     1.1.0             r36h6115d3f_0
r-cluster                 2.0.8             r36ha65eedd_0
r-codetools               0.2_16            r36h6115d3f_0
r-crayon                  1.3.4             r36h6115d3f_0
r-dbi                     1.0.0             r36h6115d3f_0
r-dbplyr                  1.4.0             r36h6115d3f_0
r-digest                  0.6.18            r36h96ca727_0
r-dplyr                   0.8.0.1           r36h29659fb_0
r-fansi                   0.4.0             r36h96ca727_0
r-foreign                 0.8_71            r36h96ca727_0
r-glue                    1.3.1             r36h96ca727_0
r-kernsmooth              2.23_15           r36ha65eedd_4
r-lattice                 0.20_38           r36h96ca727_0
r-magrittr                1.5               r36h6115d3f_4
r-mass                    7.3_51.3          r36h96ca727_0
r-matrix                  1.2_17            r36h96ca727_0
r-memoise                 1.1.0             r36h6115d3f_0
r-mgcv                    1.8_28            r36h96ca727_0
r-nlme                    3.1_139           r36ha65eedd_0
r-nnet                    7.3_12            r36h96ca727_0
r-pillar                  1.3.1             r36h6115d3f_0
r-pkgconfig               2.0.2             r36h6115d3f_0
r-plogr                   0.2.0             r36h6115d3f_0
r-prettyunits             1.0.2             r36h6115d3f_0
r-purrr                   0.3.2             r36h96ca727_0
r-r6                      2.4.0             r36h6115d3f_0
r-randomforest            4.6_14            r36ha65eedd_0
r-rcpp                    1.0.1             r36h29659fb_0
r-recommended             3.6.0                     r36_0
r-rlang                   0.3.4             r36h96ca727_0
r-rpart                   4.1_15            r36h96ca727_0
r-rsqlite                 2.1.1             r36h29659fb_0
r-spatial                 7.3_11            r36h96ca727_4
r-survival                2.44_1.1          r36h96ca727_0
r-tibble                  2.1.1             r36h96ca727_0
r-tidyselect              0.2.5             r36h29659fb_0
r-utf8                    1.1.4             r36h96ca727_0
readline                  7.0                  h7b6447c_5
requests                  2.27.1             pyhd3eb1b0_0
rpy2                      2.9.4           py36r36h481b005_0
scikit-learn              0.19.2           py36h4989274_0
scipy                     0.19.1           py36h9976243_3
setuptools                58.0.4           py36h06a4308_0
sip                       4.19.8           py36hf484d3e_0
six                       1.16.0             pyhd3eb1b0_1
snakemake                 3.13.3                   py36_0    bioconda
sqlite                    3.33.0               h62c20be_0
sysroot_linux-64          2.17                h57e8cba_10
tbb                       2021.8.0             hdb19cb5_0
tbb4py                    2021.3.0         py36hd09550d_0
tk                        8.6.12               h1ccaba5_0
tktable                   2.10                 h14c3975_0
tornado                   6.1              py36h27cfd23_0
tzlocal                   2.1                      py36_0
urllib3                   1.26.8             pyhd3eb1b0_0
wheel                     0.37.1             pyhd3eb1b0_0
wrapt                     1.12.1           py36h7b6447c_1
xz                        5.2.10               h5eee18b_1
yaml                      0.2.5                h7b6447c_0
zlib                      1.2.13               h5eee18b_0
zstd                      1.5.5                hc292b87_0
ctokheim commented 6 months ago

Hi @vmelichar.

I've attached an updated environment file to install the 2020plus environment (2020plus_environment.yml.gz), which had also worked for another user that had issues recently and I just tested worked for myself to successfully run the unit tests.

To install the environment, I would first either delete your current environment or change the name of the environment in the yaml file.

Then run conda (or mamba) to install.

conda env create -f 2020plus_environment.yml

Although for me conda was being quite slow to create the environment, so my installation was actually via mamba.

conda install mamba
mamba env create -f 2020plus_environment.yml

This should install everything that is necessary, including r dependencies. Just then need to activate the env.

A quick way to test if the 2020plus code is likely installed correctly is then to run the unit tests (this would be quicker then trying to run 20/20+ on large data and only finding out later in the pipeline there was a problem):

pip install nose
nosetests tests/test_features.py
nosetests tests/test_train.py
nosetests tests/test_classify.py
vmelichar commented 6 months ago

Thank you @ctokheim for the response. The installation of env worked perfectly. All the tests were also passed. But I am still getting an error when running on actual data. It still might be a problem with pandas....

This is the first error I get in the pipeline:

Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/bin/probabilistic2020", line 8, in <module>
    sys.exit(cli_main())
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/probabilistic2020.py", line 284, in cli_main
    main(opts)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/probabilistic2020.py", line 229, in main
    result_df = rt.main(opts, mutation_df)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/randomization_test.py", line 392, in main
    opts['use_unmapped'])
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/python/count_frameshifts.py", line 37, in count_frameshift_total
    gene_df = fs_df[fs_df['Gene']==bed.gene_name]
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/frame.py", line 2982, in __getitem__
    return self._getitem_frame(key)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/frame.py", line 3082, in _getitem_frame
    return self.where(key)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 9276, in where
    cond, other, inplace, axis, level, errors=errors, try_cast=try_cast
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 9123, in _where
    axis=block_axis,
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 557, in where
    return self.apply("where", **kwargs)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 436, in apply
    kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/util/_decorators.py", line 221, in wrapper
    return func(*args, **kwargs)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/frame.py", line 3976, in reindex
    return super().reindex(**kwargs)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 4514, in reindex
    axes, level, limit, tolerance, method, fill_value, copy
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/frame.py", line 3858, in _reindex_axes
    columns, method, copy, level, fill_value, limit, tolerance
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/frame.py", line 3906, in _reindex_columns
Dropped 139 mutations after only keeping valid SNVs
    allow_dups=False,
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
    copy=copy,
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

Then there is also this one:

Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/bin/mut_annotate", line 8, in <module>
    sys.exit(cli_main())
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/annotate.py", line 432, in cli_main
    main(opts)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/annotate.py", line 417, in main
    multiprocess_permutation(bed_dict, mut_df, opts, indel_df)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/annotate.py", line 86, in multiprocess_permutation
    indel_cts_dict = indel_df['Gene'].value_counts().to_dict()
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'value_counts'

And this one:

Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/python/utils.py", line 131, in wrapper
    result = f(*args, **kwds)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/randomization_test.py", line 46, in singleprocess_permutation
    genes_with_mut = set(mut_df['Gene'].unique())
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'unique'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/python/utils.py", line 131, in wrapper
    result = f(*args, **kwds)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/randomization_test.py", line 46, in singleprocess_permutation
    genes_with_mut = set(mut_df['Gene'].unique())
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'unique'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/melichv/miniconda3/envs/2020plus/bin/probabilistic2020", line 8, in <module>
    sys.exit(cli_main())
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/probabilistic2020.py", line 284, in cli_main
    main(opts)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/probabilistic2020.py", line 229, in main
    result_df = rt.main(opts, mutation_df)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/randomization_test.py", line 414, in main
    permutation_result = multiprocess_permutation(bed_dict, mut_df, opts)
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/site-packages/prob2020/console/randomization_test.py", line 178, in multiprocess_permutation
    for chrom_result in process_results:
  File "/home/melichv/miniconda3/envs/2020plus/lib/python3.6/multiprocessing/pool.py", line 761, in next
    raise value
AttributeError: 'DataFrame' object has no attribute 'unique'

Do you think there is a problem with my data and not with 2020plus?

Thank you for your help!

vmelichar commented 6 months ago

There was indeed an error with my input file. I produced the MAF by merging multiple MAFs, but 20/20+ takes only 8 columns as input, so many rows were duplicates. I corrected the MAF so it only contains specific columns and no duplicate rows and now everything works.

I appreciate your help earlier.