Hide singlet cell groups before marker detection

pinin4fjords commented 3 years ago

This PR addresses an issue whereby single clusters lead to division by 0 errors. I've suggested a fix in the Scanpy codebase (see https://github.com/theislab/scanpy/pull/1490), but we need this before we're likely to have a new Scanpy release, so I'm proposing an analagous fix here, whereby singlet clusters are hidden before rank_gene_groups() is run.

Edit: I've also had to pin back h5py. Versions >= 3.0.0 cause errors like:

 ✗ Filter cells and genes from a raw object
   (in test file scanpy-scripts-tests.bats, line 126)
     `run rm -f $filter_obj && eval "$scanpy filter $filter_opt $read_obj $filter_obj"' failed
   Traceback (most recent call last):
     File "/Users/jmanning/miniconda3/envs/scanpy-scripts-test/bin/scanpy-cli", line 10, in <module>
       sys.exit(cli())
     File "/path/to/click/core.py", line 829, in __call__
       return self.main(*args, **kwargs)
     File "/path/to/click/core.py", line 782, in main
       rv = self.invoke(ctx)
     File "/path/to/click/core.py", line 1259, in invoke
       return _process_result(sub_ctx.command.invoke(sub_ctx))
     File "/path/to/click/core.py", line 1066, in invoke
       return ctx.invoke(self.callback, **ctx.params)
     File "/path/to/click/core.py", line 610, in invoke
       return callback(*args, **kwargs)
     File "/path/to/scanpy_scripts/cmd_utils.py", line 43, in cmd
       func(adata, **kwargs)
     File "/path/to/scanpy_scripts/lib/_filter.py", line 37, in filter_anndata
       k_mito = gene_names.str.startswith('MT-')
     File "/path/to/pandas/core/strings.py", line 2000, in wrapper
       raise TypeError(msg)
   TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

... because of the way .var is read from the HDF5 with strings becoming bytestrings. We'll eventually need to update usages like https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/398bd0e1704566ae4ced978f3f5e2ced1f5e948a/scanpy_scripts/lib/_filter.py#L37, but for now let's just pin back h5py.

nh3 commented 3 years ago

This is a good one! Honestly, I haven't read through scanpy's code or run any test to fully understand the impact of setting some group labels to None (which becomes NaN for categorical variables), e.g. on the DE test and on the plotting (such as rank_genes_groups_dotplot(). Obviously our existing CI tests won't cover it nor do those in scanpy. Presumably you've tested the fix locally and happy to approve on that basis.

pinin4fjords commented 3 years ago

Actually the dot plots seem to be okay, though there's a warning:


>>> sc.tl.rank_genes_groups(adata, groupby = groupby)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
/path/to/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
>>> sc.pl.rank_genes_groups_dotplot(adata, groupby = groupby)
/path/to/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_string_dtype(df[key]) and not is_categorical(df[key])
WARNING: dendrogram data not found (using key=dendrogram_louvain_resolution_3.0). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 0, 1, 2, etc.```

nh3 commented 3 years ago

ok, that looks good then. Thanks for the PR!

nh3 commented 3 years ago

Seems it still triggers "ZeroDivisionError". Perhaps I did something wrong.

pinin4fjords commented 3 years ago

Seems it still triggers "ZeroDivisionError". Perhaps I did something wrong.

I'll take a look

nh3 commented 3 years ago

Looks like adding

adata.obs[groupby].cat.remove_unused_categories(inplace=True)

after setting singlets to None, suppress the error message

pinin4fjords commented 3 years ago

Looks like adding
adata.obs[groupby].cat.remove_unused_categories(inplace=True)
after setting singlets to None, suppress the error message

Nice- thanks. Was on the brink of working that out!

nh3 commented 3 years ago

Setting a resolution of 30 give rise to >300 clusters, making diffexp extremely slow. Not sure if we still need that test in CI, or if we should create singlets in a better way, eg sub-clustering by setting "restrict_to". It's entirely up to you.

pinin4fjords commented 3 years ago

Setting a resolution of 30 give rise to >300 clusters, making diffexp extremely slow. Not sure if we still need that test in CI, or if we should create singlets in a better way, eg sub-clustering by setting "restrict_to".

Yes, and resulting dotplot image is too big to create. I'll figure something out- thanks for the help!

nh3 commented 3 years ago

Looks producing plot image is also troubled by having too many clustering, although the plotting itself looks ok. Entirely up to you if you want to remove those tests, now that we know the fix works.

pinin4fjords commented 3 years ago

You might want to take a final look @nh3 . I'm just using the 'groups' argument as suggested by @ivirshup rather than the previous mechanism, and I've just passed in a text file with arbitrary groups to test the singlets.

nh3 commented 3 years ago

Yes! This is a smart approach. And the test is light too. Very good indeed!

pinin4fjords commented 3 years ago

Thanks @nh3 !

ebi-gene-expression-group / scanpy-scripts

Hide singlet cell groups before marker detection #88