Closed ambrosejcarr closed 1 year ago
@pablo-gar @atolopko-czi - are these numbers calculated on a regular basis for WMG?
@pablo-gar @atolopko-czi - are these numbers calculated on a regular basis for WMG?
The cell count, gene count, and dataset count (but not cell type count) will be calculated for WMG data with this upcoming PR. However, these numbers will differ from the full corpus since there is some dataset filtering occurring for WMG specifically. Architecturally, I think we should have the dataset processing pipeline compute these values if they're needed for live reporting.
However, these numbers will differ from the full corpus since there is some dataset filtering occurring for WMG specifically.
Correct, so we shouldn't use the WMG numbers
Architecturally, I think we should have the dataset processing pipeline compute these values if they're needed for live reporting.
@ainfeld Does something like this for the dashboards, she may have some insights into this
It would be easy enough to add an incrementer in collection publication, although revision publications would require a bit more nuance.
The dataset count and primary cell count can be determined from the analytics system. The unique cell type count is more difficult to determine since this requires performing a set union operation across all dataset cells. We have that data from the WMG pipeline, but it's explicitly filtered for WMG needs; however its cell type counts may be close enough. Still, I would not suggest having an architecture that compiles stats from multiple subsystems.
This requirement is pushing us towards needing something like SOMA's full cell-based corpus. However, SOMA's TileDB backend is not efficient for full-corpus aggregation calculations, though it might be performant enough. This assumes a SOMA corpus is available when this story needs to be embarked upon. If not, I would start considering adding an intermediate cell-based corpus representation in Spark that would ultimately support WMG, analytics, and SOMA pipelines. As an added benefit, it would also support ad hoc SQL-based queries via Databricks notebooks (more conveniently than equivalent SOMA/TileDB queries).
I believe all of these should be in the (Data corpus dashboard). Would it be possible to use some parts of my ETL pipeline? My script checks every morning for new datasets added or updated.
Does your script manage:
Yep it deals with revisions and deletions. It had dealt with deletions before, but then for some reason I changed the merge, I've now reverted that change. It does confirm the number of unique collections and dataset ids from the API matches the number of unique collections and dataset ids from my data pull every morning.
I don't think analytics can provide distinct cell type counts across the corpus, but correct me if I'm wrong.
I may be misunderstanding cell type counts. Are we referring to the ontological metadata field? What my script does is import all of the cxg files, appends, and counts distinct cell types across all of the datasets.
I may be misunderstanding cell type counts. Are we referring to the ontological metadata field? What my script does is import all of the cxg files, appends, and counts distinct cell types across all of the datasets.
That's my understanding of the cell type count as well, and it sounds like analytics provides exactly what we need then! 🎉 I clearly didn't correctly grok the code wrt cell type count.
@pablo-gar has volunteered to manually update the constants on Mondays starting Feb 27 2023 until this automation can be scheduled.
Per discussion in #single-cell-planning, Data Viz is now responsible for addressing this. The requirements must be updated to reflect the H2 focus on census numbers rather than corpus numbers.
@signechambers1 @dsadgat I've been manually updating these numbers on a weekly-basis. Should I transfer this responsibility to someone in DataViz now?
@pablo-gar not yet! thanks for updating, this ticket captures the work to automate the process. If you wouldn't mind doing it until someone on the team can pick it up that would be much appreciated.
Based on conversations with Pablo and Brian R, will move this to a P0 into the next sprint, this will take the work off Pablo as a manual task and into our automation process. https://czi-sci.slack.com/archives/C03UGMKS0K0/p1690829684091969 - for reference.
@niknak33 @dsadgat Once an engineer is assigned, please loop me into the design discussions. One reason this hasn't been taken on earlier is that our current data architecture doesn't make it easy to generate these numbers. While Analytics computes these numbers, I want to understand if there's a cleaner way to compute these values for web publication that doesn't involve a dependency on the analytics system. This issue was one of the motivations behind the Shared Integrated Corpus proposal.
@atolopko-czi I thought we were going to reuse the ETL amanda has already built?
@atolopko-czi I thought we were going to reuse the ETL amanda has already built?
I'm suggesting that it's a good time to just take another look at the existing system and assess if we improve as part of this effort, since we're adding a dependency onto a design that is not optimal (i.e. querying CXG files).
Here is the latest proposal on how to automate hero number updating. After syncing with @niknak33 and @metakuni , Solution 3 in the document is the preferred course of action moving forward: https://docs.google.com/document/d/1UIw-nyNmRF6ob46WLOnxXhyT46J-jcEglzJ9bxGV2BU/edit#heading=h.98t25nypl0d1
Waiting for @atolopko-czi to return and review before executing.
Added comments to doc. I have architectural concerns about solution 3, in particular, but provided a revised idea. Also, I think we need to record in this document precisely how Product wants counts to be computed.
Context: Currently, splash page hero numbers need to be manually updated. This creates some lag between accumulating data and conveying it to users.
See single-cell-ux.
Story: CELLxGENE Discover users have an up to date view of the magnitude of data it provides access to.
Requirements: