Simplify column_names argument in get_anndata() method

hthomas-czi commented 8 months ago

Separate the get_anndata() method column_names argument into obs_column_names and var_column_names which accept an array instead of an object.

Current

adata = cellxgene_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    column_names = {"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]}
)

Proposed

adata = cellxgene_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    obs_column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
    var_column_names =  ["feature_id", "feature_name"]

)

edited by Pablo

ivirshup commented 5 months ago

Currently this argument is passed right through to soma.ExperimentAxisQuery.to_anndata, which has the argument column_names.

Should this change be made over there?

I think it would be odd if the argument column_names was deprecated here, but still used in somacore. Especially if the new form of the argument isn't made available in somacore.

ivirshup commented 5 months ago

Emanuele suggested @aaronwolen may have thoughts on whether a change in somacore is on the table here.

Personally, I'm not sure this change is worth creating inconsistency between the APIs. But do like this change overall.

prathapsridharan commented 5 months ago

@ivirshup @ebezzi @pablo-gar - I think this is a tricky design issue and the following are my thoughts. I would like to hear your views on this:

It seems rather confusing to me that we would allow column_names, obs_column_names, and var_column_names and then write some logic to ensure mutual exclusion. The interface has not gained any clarity with this approach.

I would suggest we keep the function signature of get_anndata() as it is right now or replace column_names with obs_column_names and var_column_names in census. If soma also wants to adopt obs_column_names and var_column_names because the maintainers think it is an improvement then that would be fine but not a requirement.

Here is why:

I think we should distinguish between the public interface of census and soma. They don't necessarily have to be consistent because one is an application (census - with application specific nouns and verbs to describe its functionality) built on top of another lower level API (soma - which is a general data model supporting out-of-core processing with a query API).

For example, a census API method could wrap several soma API methods to offer some functionality and in doing so can reasonably have function and argument names that makes sense for the census application and doesn't exactly match soma function signatures.

You could imagine other applications built on top of soma that might define function names and arguments that communicate the nouns and verbs in the application better than soma could.

The fact that census does intentionally expose soma functions does make API design tricky but you could view it as census allowing you to drop down to lower level to get more fine grained control. For example, census exposes get_obs() -> pd.Dataframe, and if the user needs finer control can use obs.read() to get an iterator over arrow tables and can then materialize it in memory if they wish. Similarly, get_anndata() is a user friendly function for census users whereas soma.ExperimentAxisQuery.to_anndata is lower level and more general.

I think we should avoid doing things like the above where we support both soma type arguments and census type arguments that effectively have the same meaning and exist alongside each other in the same census API method.

In general, I think we should put more weight on on what makes usage of the census API easier rather than consistency with form of the underlying soma API.

ivirshup commented 5 months ago

It seems rather confusing to me that we would allow column_names, obs_column_names, and var_column_names and then write some logic to ensure mutual exclusion. The interface has not gained any clarity with this approach.

I have update that code slightly so that now it either errors or throws a warning if column_names is ever passed (I already had the test for it, but not the actual code). I think this should be more clear.

I think that argument is fair, but I guess I assumed there was some desire for users to access tiledbsoma for more complex use cases. Then it would be nice for them if the APIs are the same. It could also ease maintenance if the APIs don't drift much and many arguments can be passed straight through.

chanzuckerberg / cellxgene-census

Simplify column_names argument in get_anndata() method #1035