Closed pinin4fjords closed 3 years ago
This is a good one! Honestly, I haven't read through scanpy's code or run any test to fully understand the impact of setting some group labels to None (which becomes NaN for categorical variables), e.g. on the DE test and on the plotting (such as rank_genes_groups_dotplot()
. Obviously our existing CI tests won't cover it nor do those in scanpy. Presumably you've tested the fix locally and happy to approve on that basis.
Actually the dot plots seem to be okay, though there's a warning:
>>> sc.tl.rank_genes_groups(adata, groupby = groupby)
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
/path/to/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
>>> sc.pl.rank_genes_groups_dotplot(adata, groupby = groupby)
/path/to/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
WARNING: dendrogram data not found (using key=dendrogram_louvain_resolution_3.0). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different.
categories: 0, 1, 2, etc.
var_group_labels: 0, 1, 2, etc.```
ok, that looks good then. Thanks for the PR!
Seems it still triggers "ZeroDivisionError". Perhaps I did something wrong.
Seems it still triggers "ZeroDivisionError". Perhaps I did something wrong.
I'll take a look
Looks like adding
adata.obs[groupby].cat.remove_unused_categories(inplace=True)
after setting singlets to None
, suppress the error message
Looks like adding
adata.obs[groupby].cat.remove_unused_categories(inplace=True)
after setting singlets to
None
, suppress the error message
Nice- thanks. Was on the brink of working that out!
Setting a resolution of 30 give rise to >300 clusters, making diffexp extremely slow. Not sure if we still need that test in CI, or if we should create singlets in a better way, eg sub-clustering by setting "restrict_to". It's entirely up to you.
Setting a resolution of 30 give rise to >300 clusters, making diffexp extremely slow. Not sure if we still need that test in CI, or if we should create singlets in a better way, eg sub-clustering by setting "restrict_to".
Yes, and resulting dotplot image is too big to create. I'll figure something out- thanks for the help!
Looks producing plot image is also troubled by having too many clustering, although the plotting itself looks ok. Entirely up to you if you want to remove those tests, now that we know the fix works.
You might want to take a final look @nh3 . I'm just using the 'groups' argument as suggested by @ivirshup rather than the previous mechanism, and I've just passed in a text file with arbitrary groups to test the singlets.
Yes! This is a smart approach. And the test is light too. Very good indeed!
Thanks @nh3 !
This PR addresses an issue whereby single clusters lead to division by 0 errors. I've suggested a fix in the Scanpy codebase (see https://github.com/theislab/scanpy/pull/1490), but we need this before we're likely to have a new Scanpy release, so I'm proposing an analagous fix here, whereby singlet clusters are hidden before rank_gene_groups() is run.
Edit: I've also had to pin back h5py. Versions >= 3.0.0 cause errors like:
... because of the way .var is read from the HDF5 with strings becoming bytestrings. We'll eventually need to update usages like https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/398bd0e1704566ae4ced978f3f5e2ced1f5e948a/scanpy_scripts/lib/_filter.py#L37, but for now let's just pin back h5py.