chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
79 stars 20 forks source link

Installation Issues for Geneformer #950

Closed drneavin closed 7 months ago

drneavin commented 7 months ago

Describe the bug

I'm running into issues with loading some of the libraries required for running Geneformer using the Census model. I installed the required packages listed on the tutorial Requirements. However, when I then tried to load the required packages in python, I ran into some issues that I couldn't resolve. Specifically, this applies to the following:

from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer

The first time I try to import it, I receive the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/directflow/SCCGGroupShare/projects/lacgra/.conda/envs/cellxgene/lib/python3.11/site-packages/cellxgene_census/experimental/ml/__init__.py", line 5, in <module>
    from .pytorch import ExperimentDataPipe, Stats, experiment_dataloader
  File "/directflow/SCCGGroupShare/projects/lacgra/.conda/envs/cellxgene/lib/python3.11/site-packages/cellxgene_census/experimental/ml/pytorch.py", line 13, in <module>
    import psutil
ModuleNotFoundError: No module named 'psutil'

So I then I installed psutil and it throws a similar error about torchdata:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/cellxgene_census/experimental/ml/__init__.py", line 5, in <module>
    from .pytorch import ExperimentDataPipe, Stats, experiment_dataloader
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/cellxgene_census/experimental/ml/pytorch.py", line 18, in <module>
    import torchdata.datapipes.iter as pipes
ModuleNotFoundError: No module named 'torchdata'

so I installed torchdata and it throws an error about scipy and matrix:

Traceback (most recent call last):
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/scipy/__init__.py", line 137, in __getattr__
    return globals()[name]
           ~~~~~~~~~^^^^^^
KeyError: 'matrix'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/cellxgene_census/experimental/ml/__init__.py", line 5, in <module>
    from .pytorch import ExperimentDataPipe, Stats, experiment_dataloader
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/cellxgene_census/experimental/ml/pytorch.py", line 41, in <module>
    class _SOMAChunk:
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/cellxgene_census/experimental/ml/pytorch.py", line 51, in _SOMAChunk
    X: scipy.matrix
       ^^^^^^^^^^^^
  File "/directflow/SCCGGroupShare/projects/DrewNeavin/software/anaconda3/envs/census_geneformer/lib/python3.11/site-packages/scipy/__init__.py", line 139, in __getattr__
    raise AttributeError(
AttributeError: Module 'scipy' has no attribute 'matrix'

However, I can import scipy and I don't see any evidence that it has an attribute matrix:

>>>print(hasattr(scipy,'matrix'))
False

I think it's also important to note that one of the packages (I think Geneformer) requires python version between 3.7 and 3.11. I have tried this with three different python versions: 3.10.13, 3.11.0 and 3.11.7 all with the same result.

To Reproduce

  1. Start a new conda environment with python between 3.7 and 3.10
  2. Install census cellxgene with pip install -U cellxgene-census
  3. Install git lfs with conda: conda install conda-forge::git-lfs
  4. Install geneformer following installation instructions:
    git lfs install
    git clone https://huggingface.co/ctheodoris/Geneformer
    cd Geneformer
    pip install .
  5. Install AWS command line interface with conda: conda install conda-forge::awscli
  6. Start python and import packages:

    import cellxgene_census
    import json
    import warnings
    
    warnings.filterwarnings("ignore")
    
    from transformers import BertForSequenceClassification
    from transformers import Trainer
    from geneformer import DataCollatorForCellClassification
    from geneformer import TranscriptomeTokenizer
    from geneformer import EmbExtractor
    from cellxgene_census.experimental import get_embedding
    from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer

Expected behavior

I expect the package to load without issue but run into errors described above and unable to identify how to rectify it.

Environment

Provide a description of your system and the software versions.

HPC:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

I can also send through the environment yml file but it's not allowed as an attachment on github

Thanks for your assistance, I'm excited to try geneformer with the CellxGene Census model.

bkmartinjr commented 7 months ago

I believe this was just fixed (not released) by PR #944

CC: @atolopko-czi

drneavin commented 7 months ago

Hi @bkmartinjr, thanks for the fast reply. That's great news! I'll wait for the new release with this fix and see when it resolves it unless there's a dev version that I could test? Or an older version that would work?

drneavin commented 7 months ago

I can confirm that the updates in PR https://github.com/chanzuckerberg/cellxgene-census/pull/944 resolve this error by installing the update directly from github as suggested by @pablo-gar in PR #951. But I also had to install accelerate (pip install accelerate -U) and torchdata (pip install torchdata).

Thanks!

bkmartinjr commented 7 months ago

Thank you. The fix will be in the next release (coming fairly soon)