Problem

We have been instructed to accept submissions in annData files where gene expression quantifications have been generated by external groups and raw data are not available. We have no control over the methods of processing used or the quantification metric represented (CPM/ TPM/. scaled/ unscaled).

This could create a number of problems for us, e.g.:

Expression measures not matching SCXA expectations - this could be a different normalisation, for example TPM which is problematic for comparison between cells. It could also have various transformations (log, zero-centering, scaling).
Expression matrices not matching markers - unless careful steps are taken to retain the correct matrices during processing, any markers present in an object may be only peripherally related to the expression matrices and visualisation of expression values for those markers will not produce good results.
...

Solutions

Some things I think we need to head off issues in displaying these datasets:

SCXA needs a concept of expression measure.
- Just for display purposes we need to remove hard-coding of CPM values in the marker gene plots for each experiment and have that sourced from the DB.
- Unless expression measure is the same as SCXA experiments (basically CPM, without scaling), experiments supplied in this way need to be excluded from any place where cross-experiment comparison is envisaged or things will look very strange indeed.
We need to be able to turn off certain components experiment-wise. e.g. For some experiments maybe we can only show the t-SNE/ UMAP panel.

Linked specific issues:

https://github.com/ebi-gene-expression-group/atlas-web-single-cell/issues/226

Probably this is a better place for this @pinin4fjords @alfonsomunozpomer :

Cellplots
- Drop downs for plot type and parameters (right hand side)
  - We add to the database a flag field for the default and column for the source (be author or atlas). The source might be in the end distinguished by the accession namespace. (started in https://github.com/ebi-gene-expression-group/atlas-schemas/pull/27)
  - Then the frontend no longer picks the default and we get it from the database.
  - We remove the plot options from the frontend and we depend on the query result.
  - The type should be correctly capitalized in the database so that the frontend doesn't need to deal with that (and that is conserved in the URL built for the subsequent query). (started in https://github.com/ebi-gene-expression-group/db-scxa/pull/60 - this is data production side only)
  - The cellplots drop downs and their interaction with the actual cell plots, all should be covered by stories https://www.pivotaltracker.com/story/show/181712919 and URL generation related stories https://www.pivotaltracker.com/story/show/181984754, https://www.pivotaltracker.com/story/show/181713000, https://www.pivotaltracker.com/story/show/181984913 and https://www.pivotaltracker.com/story/show/181984899 (the hardest one).
- Metadata dropdown
  - We use the existing IDF field EAExpectedClusters to signal either a preferred metadata field of clustering number for coloring (already read by the web app). If this is not set, the current mechanism of picking inferred cell type takes precedence and if not desired cluster from the tsv file is used (this is all currently implemented like that). For the AnnData ingestion, the curators should always pick a preferred metadata field (if curators haven't specified a default, then the loading should fail) and this should be a reason for failure at the loading point if it is not the case. This is currently happening at the backend when you build the HTML for an experiment, the initial query sets all this value. This happens ExperimentPageContentService. Because we are forcing to have a default through the IDF file (for which there is a mechanism implemented) and checking this on loading (atlas-prod part), there is no further need for stories here.
  - The story https://www.pivotaltracker.com/story/show/181984899 covers as well this topic.
Marker genes
- There could be no marker genes
  - In this case we show the side tab with the text "marker genes unavailable"
  - Covered by https://www.pivotaltracker.com/story/show/181985517
- When we have marker genes
  - The new component handles metadata fields and greys out (we can discuss this) metadata fields that have no marker genes (current behaviour - no extra stories needed here).
  - Units of expression
    - This could go either in the experiments table or the analytics table, which would open the possibility of more than one unit per experiment. The transfer of potentially different units would need to be written in the YAML as a units field, and we would only accept one unit per experiment. However, we think that in the short it is unlikely that we will accept experiments with more than one expression matrix and hence potentially more than one expression unit. This means that the most reasonable and least resistant path is to use a single unit per experiment on the experiments table. On the data production side we need to make sure that the schema has the unit on that table and load that data (https://www.pivotaltracker.com/n/projects/2167404/stories/181985803). For the web application this would entail https://www.pivotaltracker.com/story/show/181985809.
    - On the loading web cli, we need to be open to the possibility of a YAML file being present with a unit, if so, add to the loading of the initial table. However, this is not so relevant now as we have decided to have the atlas-sc-web-cli to extract this from the IDF file. Maybe @anjaf or the curators might have their reservations about having to input the the ExpressionUnit in the IDF. Using the IDF file is convenient because Atlas web code already does that, but has now awareness of the YAML. This is covered by https://www.pivotaltracker.com/story/show/181985803 prod side.
Cell type wheel
- We distinguish between Atlas experiments and external AnnData provided exps by accession namespace.
  - We accept external experiments in the wheel as a metadata query result
  - We add to the heatmap only those with CPMs (https://www.pivotaltracker.com/story/show/181986617) and we distinguish Atlas exp from external experiment with the colored dot that we use in bulk to distinguish proteomics to RNA-Seq (https://www.pivotaltracker.com/story/show/181986522).
  - Initially, we would put the non CPMs experiments below the heatmap saying they respond to the query https://www.pivotaltracker.com/story/show/181986724

ebi-gene-expression-group / atlas-web-single-cell

Changes required to accept externally analysed datasets #218

Problem

Solutions

Linked specific issues: