chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
61 stars 12 forks source link

Paginate WMG queries by tissue while retaining support for cell type filtering #6402

Open atarashansky opened 8 months ago

atarashansky commented 8 months ago

Goal: We want to paginate WMG queries by tissue, similar to how it was done in WMG v1.

Considerations: To enable tissue-aggregated statistics for collapsed tissues and the cell type filter functionality, we need two more endpoint modes and data artifacts. One endpoint mode will be for fetching tissue aggregated stats from a tissue-aggregated cube. The other endpoint mode will be for fetching specific cell types in all tissues from a cube optimized for slicing by cell types

Some implementation notes:

In WMG v1, whenever we added a new tissue or gene, we would resubmit a query that included all PREVIOUSLY SELECTED tissues and genes. This is very inefficient.

Instead, we need a generalized strategy for computing the difference between the data we already have and the data we would like to display. We should query only for the difference and append the new data to the existing data. This “diffing” approach will apply to the tissue filter/expansion, cell type filter, and add gene affordances. If a secondary filter is changed, then all data needs to be re-fetched. The key question here is if the current hook-based architecture would support a diffing strategy, or if we need to store and read the heatmap data from the reducer store! If the latter is true, this could require a fairly meaty rearchitecting of the frontend.

With regards to the current request clobbering logic we have, we noticed that this does not necessarily prevent clobbered requests from being sent to the server, it just instructs the browser not to care about the response. This means that if we were to add multiple genes in a row with enough timing between them, we would be sending multiple unnecessary requests to the backend. We can remove clobbering and instead update the cell type filter and add genes dropdown to only trigger when the user clicks out of the dropdown (similar to the secondary filters).

Bugs found during discussion (these might need to be fixed before embarking on rearchitecture): Cell type filter is not crossfiltering into tissues Cell type filter isn’t being crossfiltered by other filters either Spinal cord is empty when European is selected and BEST4+ + bergmann glial cell are selected, and brain doesn’t auto-expand even though it has a cell type Blood/Breast + leukocyte is removing blood for some reason, even though it HAS leukocyte. And in this case, the cross-filtering DOES work??

prathapsridharan commented 8 months ago

@atarashansky @tihuan @prathapsridharan met on 01/04/2024 to understand the complexities involved. After discussion, we agree that this is a significant amount of work involving changes to the data structure used in the frontend and implementing an API on the backend that will better suites the frontend query semantics and patterns. We also agree that this will significantly improve performance by limiting the number of network calls made by the frontend and the data size sent over the network by the backend. That is, this solution is very likely needed if we want to hit the desired load times stated in the performance requirements document.

It is very hard to give a high confidence estimate how long this will take, but a rough estimate puts this at a minimum of 5 weeks with 2 engineers. Rather than fully commit to this large undertaking, we can start by concretely exploring with measurements and POCs to identify and mitigate risks early.

Here is a rough list of tasks we could take on one at a time and decide to abandon/descope if things prove to be too complicated:

  1. [3-5 days] Profile frontend performance after query returns (either using the chrome dev tools, datadog, or simple timestamps) to determine a bulk of the frontend rendering time (as much as 7 seconds after query has returned when selecting 50 genes and group-by disease) is due to heatmap rendering or deserializing/iterating through the frontend data structures.

  2. [1 sprint] Do a POC of the new frontend data structure and caching scheme - we can have this scheme run alongside the existing scheme to see if differential queries that might be generated are correct - it doesn't have to actually send the queries during the POC. Or we can POC the new scheme in a different branch. This is to mitigate risks - identify problems we hadn't thought of, demonstrate solutions to known problems, etc. This might be a good way to inform the design backend API - That is, if you understand well what the frontend wants to query for, you can design an efficient backend for it (client driven API design)

  3. [1 sprint] (2) create a new version of the API - v3?

  4. [1 sprint] Productionize (2) and (3) - unit testing, functional tests, integration (e2e) tests and retire old frontend and v2 API code

As mentioned, this is at least 5 weeks worth of work - especially (2) will take some time.