Provide access to knowledge artifacts in Census API

hthomas-czi commented 4 months ago

Users should be able to programmatically access the same data we surface in our tools, including:

Cell type descriptions
Canonical marker genes
Computational marker genes

User Quotes

Include description per cell type to easily interpret results (i.e. marker lists to easily cross-reference against publicly available sources)

Via Max on April 24, 2024:

I am a [bioinformatics scientist] at [company], and I am interested in leveraging the CZI dataset to identify and evaluate marker genes across multiple cell types. I can use CellGuide on the web browser to find marker genes for individual cell types and even download csv files that contain expression data, but they do not contain the effect sizes/specificity scores. Without having to repeat the algorithm to find computationally derived markers, is there a programmatic way to retrieve the effect sizes/specificity scores already available on CellGuide?

pablo-gar commented 4 months ago

We need to have an initial conversation with @ambrosejcarr to understand what the priority of such a unified API is for CELLxGENE Discover . And then talk to @dsadgat about it's potential home, Census is an option.

pablo-gar commented 4 months ago

Adding another use case: The ability to formulate a Census query from the Collection/Dataset page filter. The user would want to utilize the filters for a census query:

see https://cziscience.slack.com/archives/C04LMG88VKJ/p1711395634420199

From @hthomas-czi

Hi Yanay, Thank you for the message! I’m the lead designer on the CELLxGENE team. To my knowledge, there isn’t currently an easy way to do this, but it’s a really interesting idea. I’d like to see if @Brian Raymor has any input. However, I’d love to understand a bit more about your request. When you say API query, do you mean either cellxgene_census.download_source_h5ad or cellxgene_census.get_anndata or something else? Or to export searches from the datasets page to a CSV? Could you say a little more about what you mean by this as well? Thank you! (edited)

From Yanay

Any of the API calls that use filtering. This came up while I was doing a broad search of datasets using multiple disease types. To get around it, I had to copy the raw html and then modify it a bit. It would be great if there was just some button to export the query for use in obs_value_filte r

pablo-gar commented 4 months ago

I had a conversation with @ambrosejcarr on this, TODO:

Wait for CELLxGENE priorities for H2 2024, they should be ready on early April
Align this cross-API functionality to the priorities and craft a summary proposal (e.g. half-pager)
Present to CELLxGENE team

ambrosejcarr commented 3 months ago

Adjustment: priorities will be ready in the first half of May. In April we will be compiling the inputs to set priorities.

On Mon, Mar 25, 2024, 5:02 PM pablo-gar @.***> wrote:

I had a conversation with @ambrosejcarr https://github.com/ambrosejcarr on this, TODO:

Wait for CELLxGENE priorities for H2 2024, they should be ready on early April

Align this cross-API functionality to the priorities and craft a summary proposal (e.g. half-pager)

Present to CELLxGENE team

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene-census/issues/1031#issuecomment-2018909039, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH7C4AN3J2DOYFMXDXSO4DY2CGGZAVCNFSM6AAAAABEAJ6ILOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYHEYDSMBTHE . You are receiving this because you were mentioned.Message ID: @.***>

pablo-gar commented 2 months ago

Firs iteration of half-pager, still waiting for CELLxGENE priorities from @ambrosejcarr

Half-page proposal: Programmatic access to CELLxGENE knowledge artifacts

pablo-gar commented 2 months ago

See another request from a user in support to this https://cziscience.slack.com/archives/C04LMG88VKJ/p1714658879449619

pablo-gar commented 2 months ago

Anonymized user request in support to this:

I can use CellGuide on the web browser to find marker genes for individual cell types and even download csv files that contain expression data, but they do not contain the effect sizes/specificity scores. Without having to repeat the algorithm to find computationally derived markers, is there a programmatic way to retrieve the effect sizes/specificity scores already available on CellGuide?

atarashansky commented 2 months ago

I can consult on this if necessary when it comes time to implementation

MaximilianLombardo commented 2 months ago

Relevant user request - wanting to download top marker genes for all cell type in a desired tissue in gene expression:

What is the smartest way to download the top 25 computational marker genes for all cell types of human liver (as example)?

ivirshup commented 1 month ago

Discussed this a bit with @ebezzi last week. Some points that came up:

The computationally derived markers should probably get versioned/ computed alongside a census release. These could even be part of the census object
This approach probably doesn't work for the canonical marker genes or cell guide descriptions. But we grab and freeze these as part of the census build process.

Some technical approaches:

We could put these as tables in the census
We could make a new bucket and put a bunch of JSON in it
We could put REST endpoints on the website

atarashansky commented 1 month ago

@ebezzi @ivirshup

The computationally derived markers should probably get versioned/ computed alongside a census release. These could even be part of the census object

Regarding "computed": This will be difficult. The computationally-derived markers are computed straight from the WmgSnapshot. Refactoring the algorithm to operate on Census directly is a huge endeavor and will have substantial performance implications. Updating the Census builder to generate a partial WmgSnapshot to mitigate this will result in duplicate code across codebases. Duplicating the computational marker genes across codebases also will result in a lot of duplicated code. Refactoring the marker gene pipeline into a separate repo introduces more overhead.

I would recommend treating the computational marker genes the same as the canonical markers and cell guide descriptions.

prathapsridharan commented 1 month ago

@atarashansky @ebezzi @ivirshup

If the pipeline that creates the "WMG Cube" wrote some extra metadata about the census schema version, then couldn't the census api use the census schema version written and the timestamp when the "WMG Cube" was generated to make a decision, couldn't that be by the census api function to determine if the computational and canonical marker genes can be read? Also, if there is an accessible REST endpoint on the WMG side for this, then census is adequately decoupled from details of where the artifacts are stored, etc

ebezzi commented 1 month ago

I don't really want to add 3rd party dependencies to the Census (at query time) as that would drastically change the SLA we can offer to our users. What we could do if we don't wanna adapt the algorithm is to bundle the WMG builder with the Census builder (which we wanted to do anyway), and add an additional step that reads the marker genes from the WMG artifact and add them back to the Census. This is cleaner but still requires some thinking as there might be easier solutions.

prathapsridharan commented 1 month ago

It is a dependency in our own system (WMG API).

Dependency between census and other systems we own already exists. For instance, the builder reads a manifest file from a REST endpoint to eventually download the datasets

I understand that it might be undesirable to peg our query API (which is user facing and runs many times) depend on a system owned by another team. So perhaps a similar approach to how the builder deals with h5ad downloads can be applied here whereby a REST endpoint returns a manifest file and the builder reads the manifest file to download the marker genes cube. The manifest file might include:

Location of the marker genes cube
The schema of the marker genes cube
The census schema version used to generate the marker genes cube
The timestamp of when the marker genes cube was created

Cenus API acceptance tests could then contain a test to read the marker genes cube, slice it, etc. It might be useful for the manifest file to include additional_metadata like tiledb version used to create the marker genes cube for debugging purposes.

atarashansky commented 1 month ago

I might be missing something, but I am not sure if we need anything more than the timestamp of the snapshot itself. Pick the highest version number in s3://cellxgene-wmg-prod/snapshots/ and pick the latest snapshot in s3://cellxgene-wmg-prod/snapshots/{version}/. The timestamp tells you exactly when it was generated, from which you can infer what Census build it was using (WMG always reads from latest - if it's not compatible with the Census schema version, WMG pipeline fails early and does not build anything).

This being said, it's also trivial to add metadata to the snapshot that explicitly documents the Census build date.

@ebezzi In any case, I would opt for one of these two solutions rather than bundling the WMG builder with Census. I don't think that's exactly what I had in mind... I'm not sure how you mean "bundle" but I would prefer the pipelines to remain as independent pipelines that are orchestrated by some job manager.

ivirshup commented 1 month ago

I've put together a little proof of concept on how one could access these things right now: https://gist.github.com/ivirshup/e7cc5b717bad6fd32460525765e10c9b

There's no versioning or anything, which is probably necessary at some point (e.g. you want to find marker genes for all datasets for a particular version of census).

chanzuckerberg / cellxgene-census

Provide access to knowledge artifacts in Census API #1031