Calculations of mito directly generate columns with dtypes that break qc calculations on subsequent filtering

pcm32 commented 1 year ago

Running a first filter step (genes or cells) when there are no mito columns given as part of the cell metadata generates a mito column that is considered logical probably by pandas (instead of possibly categorical when read from the metadata file). This leads into the following error:

Traceback (most recent call last):
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/bin/scanpy-filter-genes", line 10, in <module>
    sys.exit(FILTER_CMD())
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scanpy_scripts/cmd_utils.py", line 46, in cmd
    func(adata, **kwargs)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scanpy_scripts/cmd_utils.py", line 288, in matrix_function
    func(
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scanpy_scripts/lib/_filter.py", line 75, in filter_anndata
    sc.pp.calculate_qc_metrics(
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scanpy/preprocessing/_qc.py", line 306, in calculate_qc_metrics
    obs_metrics = describe_obs(
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scanpy/preprocessing/_qc.py", line 123, in describe_obs
    X[:, adata.var[qc_var].values].sum(axis=1)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scipy/sparse/_index.py", line 33, in __getitem__
    row, col = self._validate_indices(key)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scipy/sparse/_index.py", line 147, in _validate_indices
    col = self._asindices(col, N)
  File "/usr/local/tools/_conda/envs/__scanpy-scripts@1.1.2/lib/python3.9/site-packages/scipy/sparse/_index.py", line 169, in _asindices
    if max_indx >= length:
TypeError: '>=' not supported between instances of 'str' and 'int'

Most likely categorical columns (from their pandas dtype) get excluded from that qc_vars list, but not for boolean/logical possibly (or the other way around).

pcm32 commented 1 year ago

The object the fails has the following dtypes:

>>> wmi.var.dtypes
gene_symbols               object
mito                     category
n_cells_by_counts           int64
mean_counts               float32
log1p_mean_counts         float32
pct_dropout_by_counts     float64
total_counts              float32
log1p_total_counts        float32
n_counts                  float32
n_cells                     int64
dtype: object

the AnnData object where the gene metadata gets loaded (with mito) apriori (and doesn't fail) looks like this:

wm_ni.var.dtypes
gene_symbols              object
mito                        bool
n_cells_by_counts          int64
mean_counts              float32
log1p_mean_counts        float32
pct_dropout_by_counts    float64
total_counts             float32
log1p_total_counts       float32
n_counts                 float32
n_cells                    int64
dtype: object

so it seems that the following qc trigger is willing to go with bool but not category (the code is actually setting that column to category at https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/develop/scanpy_scripts/lib/_filter.py#L40).

pcm32 commented 1 year ago

And this line then reproduces the error:

>>> wm_ni.X[:, wm_ni.var['mito'].values].sum(axis=1)
matrix([[34.],
        [40.],
        [42.],
        ...,
        [24.],
        [54.],
        [25.]], dtype=float32)
>>> wmi.X[:, wmi.var['mito'].values].sum(axis=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 47, in __getitem__
    row, col = self._validate_indices(key)
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 168, in _validate_indices
    col = self._asindices(col, N)
  File "/usr/local/lib/python3.9/site-packages/scipy/sparse/_index.py", line 190, in _asindices
    if max_indx >= length:
TypeError: '>=' not supported between instances of 'str' and 'int'
>>> wmi.var['mito'].dtypes
CategoricalDtype(categories=['False', 'True'], ordered=False)

Now, the question is why we might be explicitly setting that var column to categorical. At least I can say that moving to using bool there doesn't seem to break the SCXA main workflow downstream.

pcm32 commented 1 year ago

The change was introduced at https://github.com/ebi-gene-expression-group/scanpy-scripts/pull/70/files#diff-d4f03c482ed8ddbd6f6e9754d2e42001963362aa3958ee56918f9210747ef2f4R39 to allow negative filtering searches as attempted in #69 .

ebi-gene-expression-group / scanpy-scripts

Calculations of mito directly generate columns with dtypes that break qc calculations on subsequent filtering #116