Closed mbcouger closed 2 years ago
Hi Brian,
Ah yes, I believe this is because you are using a sparse matrix. Unfortunately, CellO does not yet support sparse matrices, but that is something we need to implement.
If you have a Scipy sparse matrix (https://docs.scipy.org/doc/scipy/reference/sparse.html) called X
, you can call X.todense()
to convert it to a dense matrix.
Best, Matt
Hi Matt,
So basically my overview is I have around 700,000 across 5 datasets. For these I have processed them all and gone through basically the same protocol mentioned on scanpy's pbmc 10x main page which arrives at an adata object with normalized/regressed/neighborhooded clusters based on highly differential expressed genes. I have manually ided most of what I believe to be these clusters but I am also very interested in what your pipeline ids them as ( as a long time fan of the Dewey labs works, I also was a co-author a paper with Colin was a co-author on a few years back now).
So now my efforts are to run Cello On these. These are my current endpoints with the data:
If I run on the filtered data (adata) I get the following message:
"ValueError: n_components=3000 must be between 1 and min(n_samples, n_features)=2895 with svd_solver='randomized"'
If I train on the raw data (adata.raw) and try to use this model on the filtered I get: "Error. The genes present in data matrix do not match those expected by the classifier. Please train a classifier on this input gene set by either using the cello_train_model.py program or by running cello_classify with the '-t' flag.
An exception has occurred, use %tb to see the full traceback.AttributeError: 'Raw' object has no attribute 'obs' " If I train on adata.raw and then run on adata.raw I get: "AttributeError: 'Raw' object has no attribute 'obs'" (presumably because I have not ran any of the clustering on adata.raw)
If I cluster adata.raw with the protocol you have provided I run into the sparse matrix issue on this thread. Is there an easy way to run the sparse martix fix in the scanpy pipeline?
Sorry if this is a bother, but I think you might run into these issues in the future with people running 10x/scanpy. If you get them ironed out I a bet the scanpy group would be happy to have a protocol up with the ecosystem which could bring a bunch of citation traction.
Cheers, Brian https://scholar.google.com/citations?user=-G8WTd0AAAAJ&hl=en
Hi Brian,
That's interesting to hear that you have worked with the Dewey Lab in the past! Hopefully CellO is able to hold up to your manual annotations. I would be interested to hear where/if it goes wrong. I know it does not perform as well for some cell types as others due to some cell types not having as much training data.
Regarding your first error: "ValueError n_components=3000 must be...", basically the number of features/genes you have looks to be too small to run with CellO. CellO requires at least 3000 features, but to be honest we have not tested it with so few. To get results in line with what we report in the manuscript, I would suggest running it on as many features as possible. Running CellO on the the unfiltered data is probably a better route to go. We actually have a manuscript forthcoming in the journal STAR Protocols that will provide more details around best practices for running CellO. Though hopefully I can explain everything sufficiently here.
Regarding your attempt to train CellO using the genes from the full dataset and running it on the unfiltered dataset, you get the error "Error. The genes present in data matrix do not match...." This is because when you train a CellO model based on a given input dataset, it is not actually using the dataset as training data, rather it uses a built-in training set, but trains new models using only the genes that match the provided input dataset. This ensures you have a model that is trained on the same set of genes as the dataset that you plan to run CellO on. Does this make sense?
Regarding your attempt to train CellO on adata.raw
and run CellO on adata.raw
, this is actually the best way to run CellO on your data among the methods you have described in your previous message. However, the error you are receiving is a bit puzzling to me:
"AttributeError: 'Raw' object has no attribute 'obs'". My understanding is that adata.raw
IS an AnnData object and therefore must have an obs
variable associated with it. I will look into trying to reproduce this using my own adata.raw
object.
Lastly, your suggestion to incorporate CellO within Scanpy is a goal that we do have for the future. Hopefully CellO will work well for the community as a quick and dirty way to annotate single cell data at least as a first pass or to supplement manual annotation. Hopefully I can iron out these kinks.
Best, Matt
Oh, and if you would like to "transfer" the Leiden clusters from your filtered data to your unfiltered data, my understanding is that you could run the command:
adtata.raw.obs['leiden'] = adata.obs['leiden']
This will copy the column called "leiden" from the dataframe adata.obs
to adata.raw.obs
This assumes that the cells remain in the same order and are not filtered at all between adata.obs and adata.raw
Hi Matt,
Awesome, looks like a great tool and I think it would get a lot of traction as Scanpy makes clean data and scales very well but there is currently no automated cell annotation tools in the protocols/ecosystem. In regards to my effort the transferring of obs still throws the same error. If I am to rerun the data I encounter the sparse matrix problem(below). Do you have a suggestion on what code I need to run to transfrom this. I run "from scipy import sparse" adata.todense()
AttributeError Traceback (most recent call last)
Hi Brian,
I have looked into the issue around the error AttributeError: 'Raw' object has no attribute 'obs'
. It turns out that I misunderstood the AnnData.Raw object. I didn't realize that the AnnData.Raw object ONLY has an X
matrix and a var
dataframe, but does not have its own obs
dataframe (https://anndata.readthedocs.io/en/latest/anndata.AnnData.raw.html). AnnData.Raw is NOT itself an AnnData object, but rather an instance of the class "Raw".
I do have a bit of a hack to work around this, though it is not ideal. Basically, I would suggest simply instantiating a new AnnData object based on adata.raw
:
adata_new = AnnData(
X=adata.raw.X,
var=adata.raw.var,
obs=adata.obs
)
This new object will have raw's expression matrix, raw's gene metadata, but will have the original adata's clustering information. Then, you can feed adata_new
to CellO.
Going forward it will be important to implement an option to run CellO on adata.raw
. But hopefully the above workaround works for now...
Regarding converting adata
's sparse matrix to a dense matrix, the syntax should be the following:
adata.X = adata.X.todense()
This will essentially rewrite adata
's expression matrix (the "X" variable) to be a dense version of itself.
Going forward, enabling CellO to work on sparse matrices is clearly an ability that we will need to implement for the next release.
Hello Matt,
When trying to replicate the scanpy protocol with my Raw data I get the following error after the classifier finishes:
TypeError: loop of ufunc does not support argument 0 of type SparseCSRView which has no callable exp method
Have you encountered this before?
Many Thanks, Brian