Open pcm32 opened 1 year ago
The object the fails has the following dtypes:
>>> wmi.var.dtypes
gene_symbols object
mito category
n_cells_by_counts int64
mean_counts float32
log1p_mean_counts float32
pct_dropout_by_counts float64
total_counts float32
log1p_total_counts float32
n_counts float32
n_cells int64
dtype: object
the AnnData object where the gene metadata gets loaded (with mito) apriori (and doesn't fail) looks like this:
wm_ni.var.dtypes
gene_symbols object
mito bool
n_cells_by_counts int64
mean_counts float32
log1p_mean_counts float32
pct_dropout_by_counts float64
total_counts float32
log1p_total_counts float32
n_counts float32
n_cells int64
dtype: object
so it seems that the following qc trigger is willing to go with bool but not category (the code is actually setting that column to category at https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/develop/scanpy_scripts/lib/_filter.py#L40).
And this line then reproduces the error:
>>> wm_ni.X[:, wm_ni.var['mito'].values].sum(axis=1)
matrix([[34.],
[40.],
[42.],
...,
[24.],
[54.],
[25.]], dtype=float32)
>>> wmi.X[:, wmi.var['mito'].values].sum(axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 47, in __getitem__
row, col = self._validate_indices(key)
File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 168, in _validate_indices
col = self._asindices(col, N)
File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 190, in _asindices
if max_indx >= length:
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> wmi.var['mito'].dtypes
CategoricalDtype(categories=['False', 'True'], ordered=False)
Now, the question is why we might be explicitly setting that var column to categorical. At least I can say that moving to using bool there doesn't seem to break the SCXA main workflow downstream.
The change was introduced at https://github.com/ebi-gene-expression-group/scanpy-scripts/pull/70/files#diff-d4f03c482ed8ddbd6f6e9754d2e42001963362aa3958ee56918f9210747ef2f4R39 to allow negative filtering searches as attempted in #69 .
Running a first filter step (genes or cells) when there are no mito columns given as part of the cell metadata generates a mito column that is considered logical probably by pandas (instead of possibly categorical when read from the metadata file). This leads into the following error:
Most likely categorical columns (from their pandas dtype) get excluded from that qc_vars list, but not for boolean/logical possibly (or the other way around).