aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
425 stars 179 forks source link

KeyError in pyscenic ctx (CLI) #103

Closed julienvibert closed 4 years ago

julienvibert commented 4 years ago

Hi, I'm trying to run the pyscenic CLI, I have already managed to run the grn step and I got an output "adjacencies.tsv", but when I proceed to the ctx step I have a KeyError. My command is the following one:

pyscenic ctx --mode dask_multiprocessing --annotations_fname $RESOURCES_FOLDER"motifs-v9-nr.hgnc-m0.001-o0.0.tbl" --num_workers 8 --output $DATA_FOLDER"regulons.csv" --expression_mtx_fname $DATA_FOLDER"exp.csv" $DATA_FOLDER"adjacencies.tsv" $DATABASE_FOLDER"hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather" $DATABASE_FOLDER"hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather"

And the error message I get is this one:

2019-10-25 11:44:51,593 - pyscenic.cli.pyscenic - INFO - Creating modules.

2019-10-25 11:44:53,560 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2019-10-25 11:45:02,490 - pyscenic.utils - INFO - Calculating Pearson correlations.

2019-10-25 11:45:02,490 - pyscenic.utils - WARNING - Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI. Dropout masking is currently set to [False]. Traceback (most recent call last): File "/home/julien/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'JUN'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/julien/anaconda3/bin/pyscenic", line 10, in sys.exit(main()) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 408, in main args.func(args) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 133, in prune_targets_command modules = adjacencies2modules(args) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 102, in adjacencies2modules keep_only_activating=(args.all_modules != "yes")) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/utils.py", line 265, in modules_from_adjacencies rho_threshold=rho_threshold, mask_dropouts=rho_mask_dropouts) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/utils.py", line 136, in add_correlation rhos = np.array([corr_mtx[s2][s1] for s1, s2 in zip(adjacencies.TF, adjacencies.target)]) File "/home/julien/anaconda3/lib/python3.7/site-packages/pyscenic/utils.py", line 136, in rhos = np.array([corr_mtx[s2][s1] for s1, s2 in zip(adjacencies.TF, adjacencies.target)]) File "/home/julien/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2995, in getitem indexer = self.columns.get_loc(key) File "/home/julien/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'JUN'

I wondered if it was an issue with the Pandas version (I had 0.23.4), so I upgraded to the latest one (0.25.2) but I still get the same error..

Thank you for your help! Best, Julien

bramvds commented 4 years ago

Dear Julien,

Could you check that the gene symbols in your expression matrix are unique? Just to make sure that this is not causing your issue. Thanks for your help.

Kindest regards, Bram

julienvibert commented 4 years ago

Dear @bramvds ,

Thanks for your reply. I just checked, all gene symbols in my expression matrix are unique.

Best, Julien

bramvds commented 4 years ago

Hi Julien,

Is there something special with JUN expression across samples/cells? It is discovered by GENIE3/GRNBoost2 as a TF and/or target gene but does not appear in the gene-gene correlation matrix derived from the single-cell expression matrix. This is the cause of the error you get.

Kindest regards, Bram

julienvibert commented 4 years ago

Hi Bram,

Thanks for your reply. I just checked, there is nothing special about JUN expression, at least as far as I can tell. Would it help if I could send you my two input files (expression matrix and "adjacencies.tsv" from the grn step) so that you can see for yourself?

Best, Julien

julienvibert commented 4 years ago

Hi Bram,

I have tried the same commands with another dataset to verify that the error wasn't due to this specific matrix. In fact I get exactly the same error, except that this time the problematic TF is not "JUN" but "SELENOP". I notice that "JUN" and "SELENOP" are each time the first entry of the "target" column from the output file adjacencies.tsv from the grn step. Could it be an issue with the structure of the matrix (i.e. naming or indexing of columns)?

Thanks for your help,

Julien

bramvds commented 4 years ago

HI Julien,

Just the be sure, could you check your adjacencies file (i.e. the output from the GRN step)? The extension of the file needs to match its format (if fields are separated by commas it needs to be 'csv', if the separator is tab then it should be 'tsv'). Moreover, the file should contain a header as first line:

TF,target,importance
ZNF286B,ZNF286A,210.82799147360737
ZNF286A,ZNF286B,147.96375430051933

Kindest regards, Bram

julienvibert commented 4 years ago

Hi Bram,

Thanks for your reply. I have just checked, my file is "adjacencies.tsv" and it is correctly tab-separated: head adjacencies.tsv TF target importance MAF SELENOP 183.99765918421974 MAF RNASE1 175.62542977180237

Anyway, I just managed to run the whole pipeline by using the Python tool with Jupyter, so I guess we can leave this problem unsolved especially if I'm the only person to have encountered it..

Thank you again for all your time!

Best,

Julien

cflerin commented 4 years ago

I think I've figured out what happened here after running into a similar issue recently. If there are genes present in the network output (adjacencies) that are missing from the gene expression matrix, then this KeyError will occur. This could happen for instance if some further filtering was done after running GRNBoost2, or if the wrong expression matrix was given in the CTX step.

lucygarner commented 4 years ago

Hi,

I am trying to run pyscenic ctx from the output of arboreto_with_multiprocessing.py and I am getting an error that looks related to this one - not sure if it is or not?

My commands are as follows: python arboreto_with_multiprocessing.py data/merged_all_analysed.loom resources/tfs_list/lambert2018.txt --output results/adjacencies.csv --num_workers 20

pyscenic ctx -o results/reg.csv --annotations_fname resources/motif_annotation/motifs-v9-nr.hgnc-m0.001-o0.0.tbl --num_workers 24 --expression_mtx_fname data/merged_all_analysed.loom --cell_id_attribute CellID --gene_attribute Gene results/adjacencies_arboreto.csv resources/cistarget/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather resources/cistarget/hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather

I am using the same input expression matrix for both commands.

The error I get is as follows:

Traceback (most recent call last): File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/bin/pyscenic", line 8, in sys.exit(main()) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 421, in main args.func(args) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 140, in prune_targets_command modules = adjacencies2modules(args) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 109, in adjacencies2modules keep_only_activating=(args.all_modules != "yes")) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pyscenic/utils.py", line 268, in modules_from_adjacencies rho_threshold=rho_threshold, mask_dropouts=rho_mask_dropouts) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pyscenic/utils.py", line 132, in add_correlation genes = list(set(adjacencies[COLUMN_NAME_TF]).union(set(adjacencies[COLUMN_NAME_TARGET]))) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pandas/core/frame.py", line 2995, in getitem indexer = self.columns.get_loc(key) File "/data/user/lucy/py36-v1/conda-install/envs/pyscenic/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'TF'

Do you have any suggestions as to how to fix this?

Best, Lucy

Conda environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
arboreto                  0.1.5                    pypi_0    pypi
attrs                     19.3.0                   pypi_0    pypi
bokeh                     2.0.1            py37hc8dfbb8_0    conda-forge
boltons                   20.1.0                   pypi_0    pypi
ca-certificates           2020.4.5.1           hecc5488_0    conda-forge
certifi                   2020.4.5.1       py37hc8dfbb8_0    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cloudpickle               1.4.1                      py_0    conda-forge
cytoolz                   0.10.1           py37h516909a_0    conda-forge
dask                      1.0.0                      py_1    conda-forge
dask-core                 1.0.0                      py_0    conda-forge
decorator                 4.4.2                    pypi_0    pypi
dill                      0.3.1.1                  pypi_0    pypi
distributed               1.28.1                   py37_0    conda-forge
freetype                  2.10.2               he06d7ca_0    conda-forge
frozendict                1.2                      pypi_0    pypi
h5py                      2.10.0                   pypi_0    pypi
heapdict                  1.0.1                      py_0    conda-forge
interlap                  0.2.6                    pypi_0    pypi
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
joblib                    0.15.1                   pypi_0    pypi
jpeg                      9d                   h516909a_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_4    conda-forge
libblas                   3.8.0               16_openblas    conda-forge
libcblas                  3.8.0               16_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
liblapack                 3.8.0               16_openblas    conda-forge
libopenblas               0.3.9                h5ec1e0e_0    conda-forge
libpng                    1.6.37               hed695b0_1    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libtiff                   4.1.0                hc7e4089_6    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
llvm-openmp               10.0.0               hc9558a2_0    conda-forge
llvmlite                  0.32.1                   pypi_0    pypi
locket                    0.2.0                      py_2    conda-forge
loompy                    3.0.6                    pypi_0    pypi
lz4-c                     1.9.2                he1b5a44_1    conda-forge
markupsafe                1.1.1            py37h8f50634_1    conda-forge
msgpack-python            0.6.2            py37hc9558a2_0    conda-forge
multiprocessing-on-dill   3.5.0a4                  pypi_0    pypi
ncurses                   6.1               hf484d3e_1002    conda-forge
networkx                  2.4                      pypi_0    pypi
numba                     0.49.1                   pypi_0    pypi
numpy                     1.18.4           py37h8960a57_0    conda-forge
numpy-groupies            0+unknown                pypi_0    pypi
olefile                   0.46                       py_0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
packaging                 20.4               pyh9f0ad1d_0    conda-forge
pandas                    0.25.3           py37hb3f55d8_0    conda-forge
partd                     1.1.0                      py_0    conda-forge
pillow                    7.1.2            py37h718be6c_0    conda-forge
pip                       20.1.1                     py_1    conda-forge
psutil                    5.7.0            py37h8f50634_1    conda-forge
pyarrow                   0.16.0                   pypi_0    pypi
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyscenic                  0.10.2                   pypi_0    pypi
python                    3.7.6           cpython_h8356626_6    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
pyyaml                    5.3.1            py37h8f50634_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
scikit-learn              0.23.1                   pypi_0    pypi
scipy                     1.4.1                    pypi_0    pypi
setuptools                47.1.1           py37hc8dfbb8_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sortedcontainers          2.1.0                      py_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tbb                       2020.0.133               pypi_0    pypi
tblib                     1.6.0                      py_0    conda-forge
threadpoolctl             2.1.0                    pypi_0    pypi
tk                        8.6.10               hed695b0_0    conda-forge
toolz                     0.10.0                     py_0    conda-forge
tornado                   6.0.4            py37h8f50634_1    conda-forge
tqdm                      4.46.1                   pypi_0    pypi
typing_extensions         3.7.4.2                    py_0    conda-forge
umap-learn                0.4.3                    pypi_0    pypi
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_0    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
zict                      2.0.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.4                h6597ccf_3    conda-forge
cflerin commented 4 years ago

@lc822 , what is the header of your results/adjacencies.csv file? It should be something like this (although this is tab delimited and not comma):

TF      target  importance
SPI1    TYROBP  58.97375087447331
RPL35   RPS18   58.142358119139345
RPS4X   RPL30   57.76453883874825
...
lucygarner commented 4 years ago

Yes it looks like that.

TF      target  importance
ZBTB32  IFNG    569.8034320534202
YBX1    RPS2    357.7026283177716
ZBTB32  SEC61G  316.593626747196
ZBTB32  SEC61B  309.9030666846719
...

It appears to be tab delimited.

Acribbs commented 4 years ago

Hi @cflerin,

Just an observation, without knowing anything about the code implementation (part of a team working alongside @lc822 ), could the error be related to the header being used within the pandas hash function?

File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'TF'

The reason I ask is that the error complains that a keyError is not found using "TF", but if I look at the tab file above it seems like TF forms part of the header and not a gene.

cflerin commented 4 years ago

Hi @lc822 , @Acribbs ,

Indeed, it seems like pandas is looking for a gene named "TF", which should be part of the header.

Could you try renaming the file to end with .tsv if it's really tab separated? If your file is actually tab delimited, but named with a .csv extension this will cause an issue with the file delimiter detection, which is based on file extension.

This seems to be a bug in the arboreto script actually, which always uses tab as a separator, while you've requested the output to be comma-separated, and the ctx step is looking for commas. I'll make a fix for this.