chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Automatically update landing page "hero numbers" #344

Closed ambrosejcarr closed 1 year ago

ambrosejcarr commented 2 years ago

Context: Currently, splash page hero numbers need to be manually updated. This creates some lag between accumulating data and conveying it to users.

See single-cell-ux.

Story: CELLxGENE Discover users have an up to date view of the magnitude of data it provides access to.

Requirements:

brianraymor commented 1 year ago

@pablo-gar @atolopko-czi - are these numbers calculated on a regular basis for WMG?

atolopko-czi commented 1 year ago

@pablo-gar @atolopko-czi - are these numbers calculated on a regular basis for WMG?

The cell count, gene count, and dataset count (but not cell type count) will be calculated for WMG data with this upcoming PR. However, these numbers will differ from the full corpus since there is some dataset filtering occurring for WMG specifically. Architecturally, I think we should have the dataset processing pipeline compute these values if they're needed for live reporting.

pablo-gar commented 1 year ago

However, these numbers will differ from the full corpus since there is some dataset filtering occurring for WMG specifically.

Correct, so we shouldn't use the WMG numbers

Architecturally, I think we should have the dataset processing pipeline compute these values if they're needed for live reporting.

@ainfeld Does something like this for the dashboards, she may have some insights into this

brianraymor commented 1 year ago

It would be easy enough to add an incrementer in collection publication, although revision publications would require a bit more nuance.

atolopko-czi commented 1 year ago

The dataset count and primary cell count can be determined from the analytics system. The unique cell type count is more difficult to determine since this requires performing a set union operation across all dataset cells. We have that data from the WMG pipeline, but it's explicitly filtered for WMG needs; however its cell type counts may be close enough. Still, I would not suggest having an architecture that compiles stats from multiple subsystems.

This requirement is pushing us towards needing something like SOMA's full cell-based corpus. However, SOMA's TileDB backend is not efficient for full-corpus aggregation calculations, though it might be performant enough. This assumes a SOMA corpus is available when this story needs to be embarked upon. If not, I would start considering adding an intermediate cell-based corpus representation in Spark that would ultimately support WMG, analytics, and SOMA pipelines. As an added benefit, it would also support ad hoc SQL-based queries via Databricks notebooks (more conveniently than equivalent SOMA/TileDB queries).

ainfeld commented 1 year ago

I believe all of these should be in the (Data corpus dashboard). Would it be possible to use some parts of my ETL pipeline? My script checks every morning for new datasets added or updated.

brianraymor commented 1 year ago

Does your script manage:

  1. Revisions - a published dataset is updated.
  2. Deletions - a published dataset is deleted. This is an edge case because we have only deleted/withdrawn one published collection since the beginning of time.
ainfeld commented 1 year ago

Yep it deals with revisions and deletions. It had dealt with deletions before, but then for some reason I changed the merge, I've now reverted that change. It does confirm the number of unique collections and dataset ids from the API matches the number of unique collections and dataset ids from my data pull every morning.

atolopko-czi commented 1 year ago

I don't think analytics can provide distinct cell type counts across the corpus, but correct me if I'm wrong.

ainfeld commented 1 year ago

I may be misunderstanding cell type counts. Are we referring to the ontological metadata field? What my script does is import all of the cxg files, appends, and counts distinct cell types across all of the datasets.

atolopko-czi commented 1 year ago

I may be misunderstanding cell type counts. Are we referring to the ontological metadata field? What my script does is import all of the cxg files, appends, and counts distinct cell types across all of the datasets.

That's my understanding of the cell type count as well, and it sounds like analytics provides exactly what we need then! 🎉 I clearly didn't correctly grok the code wrt cell type count.

brianraymor commented 1 year ago

@pablo-gar has volunteered to manually update the constants on Mondays starting Feb 27 2023 until this automation can be scheduled.

brianraymor commented 1 year ago

Per discussion in #single-cell-planning, Data Viz is now responsible for addressing this. The requirements must be updated to reflect the H2 focus on census numbers rather than corpus numbers.

pablo-gar commented 1 year ago

@signechambers1 @dsadgat I've been manually updating these numbers on a weekly-basis. Should I transfer this responsibility to someone in DataViz now?

signechambers1 commented 1 year ago

@pablo-gar not yet! thanks for updating, this ticket captures the work to automate the process. If you wouldn't mind doing it until someone on the team can pick it up that would be much appreciated.

niknak33 commented 1 year ago

Based on conversations with Pablo and Brian R, will move this to a P0 into the next sprint, this will take the work off Pablo as a manual task and into our automation process. https://czi-sci.slack.com/archives/C03UGMKS0K0/p1690829684091969 - for reference.

atolopko-czi commented 1 year ago

@niknak33 @dsadgat Once an engineer is assigned, please loop me into the design discussions. One reason this hasn't been taken on earlier is that our current data architecture doesn't make it easy to generate these numbers. While Analytics computes these numbers, I want to understand if there's a cleaner way to compute these values for web publication that doesn't involve a dependency on the analytics system. This issue was one of the motivations behind the Shared Integrated Corpus proposal.

dsadgat commented 1 year ago

@atolopko-czi I thought we were going to reuse the ETL amanda has already built?

atolopko-czi commented 1 year ago

@atolopko-czi I thought we were going to reuse the ETL amanda has already built?

I'm suggesting that it's a good time to just take another look at the existing system and assess if we improve as part of this effort, since we're adding a dependency onto a design that is not optimal (i.e. querying CXG files).

atarashansky commented 1 year ago

Here is the latest proposal on how to automate hero number updating. After syncing with @niknak33 and @metakuni , Solution 3 in the document is the preferred course of action moving forward: https://docs.google.com/document/d/1UIw-nyNmRF6ob46WLOnxXhyT46J-jcEglzJ9bxGV2BU/edit#heading=h.98t25nypl0d1

Waiting for @atolopko-czi to return and review before executing.

atolopko-czi commented 1 year ago

Added comments to doc. I have architectural concerns about solution 3, in particular, but provided a revised idea. Also, I think we need to record in this document precisely how Product wants counts to be computed.