Closed hthomas-czi closed 3 months ago
The census
argument is necessary because open_soma
returns a SOMA object, and not a Census (internal) object. To change it, we should create a wrapper class.
Should the arguments to get_obs
include obs
? In my draft PR I've provided the arguments: value_filter
, coords
, and column_names
.
@pablo-gar @ivirshup @ebezzi - I believe some clarification is needed here. _To me, the proposal seems to indicate a replacement ofobs.read()
with get_obs()
_. However, obs.read()
is a tiledbsoma
call that returns the result as an iterator overArrow Tables
:
I think the intention here is for get_obs()
to return a pandas dataframe but still preserving the ability to call obs.read()
. If I understand the intention correctly, it might be worth distinguishing this clearly by comment or preferably thinking about a better name than get_obs()
- perhaps get_obs_df()
?
If get_obs()
is intended to return an iterator over Arrow Tables then I am little skeptical about its value. The purpose of SOMA is to offer an easy to use API for out-of-core processing and I don't think writing a wrapper around the out-of-core parts of the SOMA API is useful.
I believe this proposes to return a Pandas Dataframe. After all, I can't even think of a use case that does not involve reading obs into a dataframe. get_obs_df
seems like a good naming suggestion.
I think the intention here is for get_obs() to return a pandas dataframe
This was my assumption and how I wrote #1151. I think it could be nice if there was an option to return alternative classes, but pandas seems like a good place to start.
Should the arguments to
get_obs
includeobs
? In my draft PR I've provided the arguments:value_filter
,coords
, andcolumn_names
.
@ivirshup - Can you explain a bit more about why get_obs
would need to include obs
in its function arguments? Do you mean obs
the Dataframe
object?
Can you explain a bit more about why get_obs would need to include obs in its function arguments?
Ah, I mean "should the argument names include the prefix obs_
, because the proposal from @pablo-gar included it for one of them.
So I would think the options for signature are:
def get_obs(
census: soma.Collection,
organism: str,
*,
value_filter: Optional[str] = None,
coords: Optional[SparseDFCoord] = slice(None),
column_names: Optional[Sequence[str]] = None,
) -> pd.DataFrame:
This is similar to the argument names from SomaExperiment.obs.read
I prefer the this one since it matches it's consistent with tiledbsoma and it's more concise. But I'm also new to the package, so not a super strongly held opinion.
obs_
def get_obs(
census: soma.Collection,
organism: str,
*,
obs_value_filter: Optional[str] = None,
obs_coords: Optional[SparseDFCoord] = slice(None),
obs_column_names: Optional[Sequence[str]] = None,
) -> pd.DataFrame:
and similar for var.
This is close to what Pablo writes above, except I've updated the column_names
to obs_column_names
argument to follow the convention proposed in #1035.
@ivirshup - Thanks! Yea I agree with you in that _I prefer we do not put a "obs" prefix to the arguments since the function name, get_obs()
, and its return type, pd.Dataframe
, makes it clear that we are operating on obs
. Moreoever, the function docstring would also make it clear what value_filter
, coords
, etc mean in detail_
Agreed on not having obs/var prefix for the arguments
Necro-ing this issue a bit. As I go through the docs (specifically the notebooks) for another PR I am seeing a lot of places where there's code like:
obs_df = census["census_data"][experiment_name].obs.read(
value_filter="tissue_general == 'central nervous system'",
column_names=["soma_joinid", "cell_type"],
)
Should I go through and replace these instances with code that looks like:
cellxgene_census.get_obs(
census,
experiment_name,
value_filter="tissue_general == 'central nervous system'",
column_names=["soma_joinid", "cell_type"],
)
?
Good catch! Yes we should update these docs.
After the read()
there should be a .concat().to_pandas()
and ultimately obs.read(...).concat().to_pandas()
is what needs to be replaced I think
I will reopen this to make sure that change is tracked. Lemme know if I should open a fresh issue instead.
Description
This would be similar to
cellxgene_census.get_anndata
and reduce the amount of code users have to write.Current
Proposal Edited by Pablo
Questions
census
argument necessary? Are there other values forcensus
? Could the convenience methods be defined on the census object rather thancellxgene_census
?