deweylab / CellO

CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
MIT License
64 stars 13 forks source link

loop of ufunc does not support argument 0 #8

Closed mbcouger closed 2 years ago

mbcouger commented 3 years ago

Hello Matt,

When trying to replicate the scanpy protocol with my Raw data I get the following error after the classifier finishes:

TypeError: loop of ufunc does not support argument 0 of type SparseCSRView which has no callable exp method

Have you encountered this before?

Many Thanks, Brian

mbernste commented 3 years ago

Hi Brian,

Ah yes, I believe this is because you are using a sparse matrix. Unfortunately, CellO does not yet support sparse matrices, but that is something we need to implement.

If you have a Scipy sparse matrix (https://docs.scipy.org/doc/scipy/reference/sparse.html) called X, you can call X.todense() to convert it to a dense matrix.

Best, Matt

mbcouger commented 3 years ago

Hi Matt,

So basically my overview is I have around 700,000 across 5 datasets. For these I have processed them all and gone through basically the same protocol mentioned on scanpy's pbmc 10x main page which arrives at an adata object with normalized/regressed/neighborhooded clusters based on highly differential expressed genes. I have manually ided most of what I believe to be these clusters but I am also very interested in what your pipeline ids them as ( as a long time fan of the Dewey labs works, I also was a co-author a paper with Colin was a co-author on a few years back now).

So now my efforts are to run Cello On these. These are my current endpoints with the data:

If I run on the filtered data (adata) I get the following message:

"ValueError: n_components=3000 must be between 1 and min(n_samples, n_features)=2895 with svd_solver='randomized"'

If I train on the raw data (adata.raw) and try to use this model on the filtered I get: "Error. The genes present in data matrix do not match those expected by the classifier. Please train a classifier on this input gene set by either using the cello_train_model.py program or by running cello_classify with the '-t' flag.

An exception has occurred, use %tb to see the full traceback.AttributeError: 'Raw' object has no attribute 'obs' " If I train on adata.raw and then run on adata.raw I get: "AttributeError: 'Raw' object has no attribute 'obs'" (presumably because I have not ran any of the clustering on adata.raw)

If I cluster adata.raw with the protocol you have provided I run into the sparse matrix issue on this thread. Is there an easy way to run the sparse martix fix in the scanpy pipeline?

Sorry if this is a bother, but I think you might run into these issues in the future with people running 10x/scanpy. If you get them ironed out I a bet the scanpy group would be happy to have a protocol up with the ecosystem which could bring a bunch of citation traction.

Cheers, Brian https://scholar.google.com/citations?user=-G8WTd0AAAAJ&hl=en

mbernste commented 3 years ago

Hi Brian,

That's interesting to hear that you have worked with the Dewey Lab in the past! Hopefully CellO is able to hold up to your manual annotations. I would be interested to hear where/if it goes wrong. I know it does not perform as well for some cell types as others due to some cell types not having as much training data.

Regarding your first error: "ValueError n_components=3000 must be...", basically the number of features/genes you have looks to be too small to run with CellO. CellO requires at least 3000 features, but to be honest we have not tested it with so few. To get results in line with what we report in the manuscript, I would suggest running it on as many features as possible. Running CellO on the the unfiltered data is probably a better route to go. We actually have a manuscript forthcoming in the journal STAR Protocols that will provide more details around best practices for running CellO. Though hopefully I can explain everything sufficiently here.

Regarding your attempt to train CellO using the genes from the full dataset and running it on the unfiltered dataset, you get the error "Error. The genes present in data matrix do not match...." This is because when you train a CellO model based on a given input dataset, it is not actually using the dataset as training data, rather it uses a built-in training set, but trains new models using only the genes that match the provided input dataset. This ensures you have a model that is trained on the same set of genes as the dataset that you plan to run CellO on. Does this make sense?

Regarding your attempt to train CellO on adata.raw and run CellO on adata.raw, this is actually the best way to run CellO on your data among the methods you have described in your previous message. However, the error you are receiving is a bit puzzling to me: "AttributeError: 'Raw' object has no attribute 'obs'". My understanding is that adata.raw IS an AnnData object and therefore must have an obs variable associated with it. I will look into trying to reproduce this using my own adata.raw object.

Lastly, your suggestion to incorporate CellO within Scanpy is a goal that we do have for the future. Hopefully CellO will work well for the community as a quick and dirty way to annotate single cell data at least as a first pass or to supplement manual annotation. Hopefully I can iron out these kinks.

Best, Matt

mbernste commented 3 years ago

Oh, and if you would like to "transfer" the Leiden clusters from your filtered data to your unfiltered data, my understanding is that you could run the command:

adtata.raw.obs['leiden'] = adata.obs['leiden']

This will copy the column called "leiden" from the dataframe adata.obs to adata.raw.obs

This assumes that the cells remain in the same order and are not filtered at all between adata.obs and adata.raw

mbcouger commented 3 years ago

Hi Matt,

Awesome, looks like a great tool and I think it would get a lot of traction as Scanpy makes clean data and scales very well but there is currently no automated cell annotation tools in the protocols/ecosystem. In regards to my effort the transferring of obs still throws the same error. If I am to rerun the data I encounter the sparse matrix problem(below). Do you have a suggestion on what code I need to run to transfrom this. I run "from scipy import sparse" adata.todense()

and get-

AttributeError Traceback (most recent call last)

in ----> 1 adata.todense() AttributeError: 'AnnData' object has no attribute 'todense' Please bear with me on the syntax stuff, most of it new to me. Cheers, Brian Writing trained model to SLTrain.model.dill Found CellO resources at '/home/cheney/Downloads/NormalJune5th/filtered_feature_bc_matrix/resources'. /home/cheney/.local/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version. res = method(*args, **kwargs) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __getattr__(self, attr) 686 else: --> 687 raise AttributeError(attr + " not found") 688 AttributeError: exp not found The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) in 16 sc.tl.leiden(adata, resolution=1.0) 17 model_prefix = "SLTrain" # <-- The trained model will be stored in a file called GSM3516666_LX682_NORMAL.model.dill ---> 18 cello.scanpy_cello( 19 adata, 20 'leiden', ~/anaconda3/lib/python3.8/site-packages/cello/scanpy_cello.py in cello(adata, clust_key, rsrc_loc, algo, out_prefix, model_file, log_dir, term_ids, remove_anatomical_subterms) 144 145 # Run classification --> 146 results_df, finalized_binary_results_df, ms_results_df = ce.predict( 147 adata, 148 mod, ~/anaconda3/lib/python3.8/site-packages/cello/cello.py in predict(ad, mod, algo, clust_key, log_dir, remove_anatomical_subterms, rsrc_loc) 230 231 # Compute raw classifier probabilities --> 232 results_df, cell_to_clust = _raw_probabilities( 233 ad, 234 mod, ~/anaconda3/lib/python3.8/site-packages/cello/cello.py in _raw_probabilities(ad, mod, algo, clust_key, log_dir) 427 ) 428 --> 429 ad_clust = _combine_by_cluster(ad) 430 # If there's only one cluster, expand dimensions of expression 431 # matrix. AnnData shrinks it, so we need to keep it as a Numpy ~/anaconda3/lib/python3.8/site-packages/cello/cello.py in _combine_by_cluster(ad, clust_key) 471 cells = ad.obs.loc[ad.obs[clust_key] == clust].index 472 X_clust = ad[cells,:].X --> 473 x_clust = _aggregate_expression(X_clust) 474 X_mean_clust.append(x_clust) 475 clusters.append(str(clust)) ~/anaconda3/lib/python3.8/site-packages/cello/cello.py in _aggregate_expression(X) 452 to form a psuedo-bulk expression profile. 453 """ --> 454 X = (np.exp(X)-1) / 1e6 455 x_clust = np.sum(X, axis=0) 456 sum_x_clust = float(sum(x_clust)) TypeError: loop of ufunc does not support argument 0 of type SparseCSRView which has no callable exp method
mbernste commented 3 years ago

Hi Brian,

I have looked into the issue around the error AttributeError: 'Raw' object has no attribute 'obs'. It turns out that I misunderstood the AnnData.Raw object. I didn't realize that the AnnData.Raw object ONLY has an X matrix and a var dataframe, but does not have its own obs dataframe (https://anndata.readthedocs.io/en/latest/anndata.AnnData.raw.html). AnnData.Raw is NOT itself an AnnData object, but rather an instance of the class "Raw".

I do have a bit of a hack to work around this, though it is not ideal. Basically, I would suggest simply instantiating a new AnnData object based on adata.raw:

adata_new = AnnData(
    X=adata.raw.X,
    var=adata.raw.var,
    obs=adata.obs
)

This new object will have raw's expression matrix, raw's gene metadata, but will have the original adata's clustering information. Then, you can feed adata_new to CellO.

Going forward it will be important to implement an option to run CellO on adata.raw. But hopefully the above workaround works for now...

mbernste commented 3 years ago

Regarding converting adata's sparse matrix to a dense matrix, the syntax should be the following:

adata.X = adata.X.todense()

This will essentially rewrite adata's expression matrix (the "X" variable) to be a dense version of itself.

Going forward, enabling CellO to work on sparse matrices is clearly an ability that we will need to implement for the next release.