chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

Add convenience wrapper for querying axis metadata #1037

Closed hthomas-czi closed 3 months ago

hthomas-czi commented 8 months ago

Description

This would be similar to cellxgene_census.get_anndata and reduce the amount of code users have to write.

Current

cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
    value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
)

Proposal Edited by Pablo

cell_metadata = cellxgene_census.get_obs(
    census = census,
    organism = "Homo sapiens",
    obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    column_names = {"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]},
)

Questions

ebezzi commented 8 months ago

The census argument is necessary because open_soma returns a SOMA object, and not a Census (internal) object. To change it, we should create a wrapper class.

ivirshup commented 5 months ago

Should the arguments to get_obs include obs? In my draft PR I've provided the arguments: value_filter, coords, and column_names.

prathapsridharan commented 5 months ago

@pablo-gar @ivirshup @ebezzi - I believe some clarification is needed here. _To me, the proposal seems to indicate a replacement ofobs.read() with get_obs()_. However, obs.read() is a tiledbsoma call that returns the result as an iterator overArrow Tables:

https://tiledbsoma.readthedocs.io/en/latest/_autosummary/tiledbsoma.DataFrame.html#tiledbsoma.DataFrame

I think the intention here is for get_obs() to return a pandas dataframe but still preserving the ability to call obs.read(). If I understand the intention correctly, it might be worth distinguishing this clearly by comment or preferably thinking about a better name than get_obs() - perhaps get_obs_df()?

If get_obs() is intended to return an iterator over Arrow Tables then I am little skeptical about its value. The purpose of SOMA is to offer an easy to use API for out-of-core processing and I don't think writing a wrapper around the out-of-core parts of the SOMA API is useful.

ebezzi commented 5 months ago

I believe this proposes to return a Pandas Dataframe. After all, I can't even think of a use case that does not involve reading obs into a dataframe. get_obs_df seems like a good naming suggestion.

ivirshup commented 5 months ago

I think the intention here is for get_obs() to return a pandas dataframe

This was my assumption and how I wrote #1151. I think it could be nice if there was an option to return alternative classes, but pandas seems like a good place to start.

prathapsridharan commented 5 months ago

Should the arguments to get_obs include obs? In my draft PR I've provided the arguments: value_filter, coords, and column_names.

@ivirshup - Can you explain a bit more about why get_obs would need to include obs in its function arguments? Do you mean obs the Dataframe object?

ivirshup commented 5 months ago

Can you explain a bit more about why get_obs would need to include obs in its function arguments?

Ah, I mean "should the argument names include the prefix obs_, because the proposal from @pablo-gar included it for one of them.

So I would think the options for signature are:

What I did

def get_obs(
    census: soma.Collection,
    organism: str,
    *,
    value_filter: Optional[str] = None,
    coords: Optional[SparseDFCoord] = slice(None),
    column_names: Optional[Sequence[str]] = None,
) -> pd.DataFrame:

This is similar to the argument names from SomaExperiment.obs.read

I prefer the this one since it matches it's consistent with tiledbsoma and it's more concise. But I'm also new to the package, so not a super strongly held opinion.

Prefixing obs_

def get_obs(
    census: soma.Collection,
    organism: str,
    *,
    obs_value_filter: Optional[str] = None,
    obs_coords: Optional[SparseDFCoord] = slice(None),
    obs_column_names: Optional[Sequence[str]] = None,
) -> pd.DataFrame:

and similar for var.

This is close to what Pablo writes above, except I've updated the column_names to obs_column_names argument to follow the convention proposed in #1035.

prathapsridharan commented 5 months ago

@ivirshup - Thanks! Yea I agree with you in that _I prefer we do not put a "obs" prefix to the arguments since the function name, get_obs(), and its return type, pd.Dataframe, makes it clear that we are operating on obs. Moreoever, the function docstring would also make it clear what value_filter, coords, etc mean in detail_

pablo-gar commented 5 months ago

Agreed on not having obs/var prefix for the arguments

ivirshup commented 5 months ago

Necro-ing this issue a bit. As I go through the docs (specifically the notebooks) for another PR I am seeing a lot of places where there's code like:

obs_df = census["census_data"][experiment_name].obs.read(
    value_filter="tissue_general == 'central nervous system'",
    column_names=["soma_joinid", "cell_type"],
)

Should I go through and replace these instances with code that looks like:

cellxgene_census.get_obs(
    census,
    experiment_name,
    value_filter="tissue_general == 'central nervous system'",
    column_names=["soma_joinid", "cell_type"],
)

?

pablo-gar commented 5 months ago

Good catch! Yes we should update these docs.

prathapsridharan commented 5 months ago

After the read() there should be a .concat().to_pandas() and ultimately obs.read(...).concat().to_pandas() is what needs to be replaced I think

ivirshup commented 5 months ago

I will reopen this to make sure that change is tracked. Lemme know if I should open a fresh issue instead.