We have been instructed to accept submissions in annData files where gene expression quantifications have been generated by external groups and raw data are not available. We have no control over the methods of processing used or the quantification metric represented (CPM/ TPM/. scaled/ unscaled).
This could create a number of problems for us, e.g.:
Expression measures not matching SCXA expectations - this could be a different normalisation, for example TPM which is problematic for comparison between cells. It could also have various transformations (log, zero-centering, scaling).
Expression matrices not matching markers - unless careful steps are taken to retain the correct matrices during processing, any markers present in an object may be only peripherally related to the expression matrices and visualisation of expression values for those markers will not produce good results.
...
Solutions
Some things I think we need to head off issues in displaying these datasets:
SCXA needs a concept of expression measure.
Just for display purposes we need to remove hard-coding of CPM values in the marker gene plots for each experiment and have that sourced from the DB.
Unless expression measure is the same as SCXA experiments (basically CPM, without scaling), experiments supplied in this way need to be excluded from any place where cross-experiment comparison is envisaged or things will look very strange indeed.
We need to be able to turn off certain components experiment-wise. e.g. For some experiments maybe we can only show the t-SNE/ UMAP panel.
Then the frontend no longer picks the default and we get it from the database.
We remove the plot options from the frontend and we depend on the query result.
The type should be correctly capitalized in the database so that the frontend doesn't need to deal with that (and that is conserved in the URL built for the subsequent query). (started in https://github.com/ebi-gene-expression-group/db-scxa/pull/60 - this is data production side only)
We use the existing IDF field EAExpectedClusters to signal either a preferred metadata field of clustering number for coloring (already read by the web app). If this is not set, the current mechanism of picking inferred cell type takes precedence and if not desired cluster from the tsv file is used (this is all currently implemented like that). For the AnnData ingestion, the curators should always pick a preferred metadata field (if curators haven't specified a default, then the loading should fail) and this should be a reason for failure at the loading point if it is not the case. This is currently happening at the backend when you build the HTML for an experiment, the initial query sets all this value. This happens ExperimentPageContentService. Because we are forcing to have a default through the IDF file (for which there is a mechanism implemented) and checking this on loading (atlas-prod part), there is no further need for stories here.
The new component handles metadata fields and greys out (we can discuss this) metadata fields that have no marker genes (current behaviour - no extra stories needed here).
Units of expression
This could go either in the experiments table or the analytics table, which would open the possibility of more than one unit per experiment. The transfer of potentially different units would need to be written in the YAML as a units field, and we would only accept one unit per experiment. However, we think that in the short it is unlikely that we will accept experiments with more than one expression matrix and hence potentially more than one expression unit. This means that the most reasonable and least resistant path is to use a single unit per experiment on the experiments table. On the data production side we need to make sure that the schema has the unit on that table and load that data (https://www.pivotaltracker.com/n/projects/2167404/stories/181985803). For the web application this would entail https://www.pivotaltracker.com/story/show/181985809.
On the loading web cli, we need to be open to the possibility of a YAML file being present with a unit, if so, add to the loading of the initial table. However, this is not so relevant now as we have decided to have the atlas-sc-web-cli to extract this from the IDF file. Maybe @anjaf or the curators might have their reservations about having to input the the ExpressionUnit in the IDF. Using the IDF file is convenient because Atlas web code already does that, but has now awareness of the YAML. This is covered by https://www.pivotaltracker.com/story/show/181985803 prod side.
Cell type wheel
We distinguish between Atlas experiments and external AnnData provided exps by accession namespace.
We accept external experiments in the wheel as a metadata query result
Problem
We have been instructed to accept submissions in annData files where gene expression quantifications have been generated by external groups and raw data are not available. We have no control over the methods of processing used or the quantification metric represented (CPM/ TPM/. scaled/ unscaled).
This could create a number of problems for us, e.g.:
Solutions
Some things I think we need to head off issues in displaying these datasets:
Linked specific issues: