Closed hannes-ucsc closed 3 weeks ago
@hannes-ucsc spike to outline solutions @theathorn to bring up to leadership
I asked Kylee about calculating the number of cells from the output matrix:
Kylee: I don't think we calculate a metric like this already, but one could maybe pull the size of the matrix and call that the number of cells, in this case because it's from pipeline outputs it'd be all cells no matter whether they are empty or have even just 1 UMI. As long as we clearly define the number (maybe even with a tooltip?) then that could be fine.
Kishori Konwar: I agree with @kdegs, you will get total no of cells but most of them would be filtered out in a downstream analysis. For example, in a 10x sample you would get 200k cells whereas after filtering you would end up with only 20k.
There is a way to get the number of cells from the final loom file after the filtering. It is the number of columns. It needs a bit more work to get it to the metadata and index it for the browser.
So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?
posting the comment I wrote on slack here:
the number in the browser is calculated by a field on the cell_suspension entity, sometimes this is not filled out by a contributor and there is no way to infer it from a publication, we would interpret this field as the number of cells that goes into the machine rather than the number of cells that come out after filtering/analysis. Perhaps it is worth adding an optional field to the analysis_file schema so that the number of cells can be filled in for each matrix, this would be unfiltered cells in the intermediate matrix and filtered cells in the 'project-level' matrix, it would then be a matter for the browser to decide which cell count is displayed, whereas a user would have access to all counts if they interrogate the metadata.
So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?
Just to clarify: this count is not always absent, but it is indeed a piece of metadata that is pretty hard to get for 10x datasets (Usually scientists give an overall cell count, not per cell suspension, the only reason we can get it easily for SS2 is that there is 1 cell/cell suspension)
I asked Kylee about calculating the number of cells from the output matrix: Kylee: I don't think we calculate a metric like this already, but one could maybe pull the size of the matrix and call that the number of cells, in this case because it's from pipeline outputs it'd be all cells no matter whether they are empty or have even just 1 UMI. As long as we clearly define the number (maybe even with a tooltip?) then that could be fine. Kishori Konwar: I agree with @KDegs, you will get total no of cells but most of them would be filtered out in a downstream analysis. For example, in a 10x sample you would get 200k cells whereas after filtering you would end up with only 20k
I have to say I do prefer to get the pre-filtered cell count estimate, as that is usually how we specify it at the project level (as @mshadbolt explained above, we try to fill the number of cells that goes into the machine)
Right right, so the cell suspension cell count estimate is only filled in upon ingestion, not updated by analysis. Not sure if we've thought through how to update metadata based on analysis results, but it's an interesting idea we could pursue in the future.
For now, I agree it's not clear that the cell count at the cell suspension level is helpful, maybe it'd be best to not include that column in the 'files' tab?
@ESapenaVentura
So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?
Just to clarify: this count is not always absent, but it is indeed a piece of metadata that is pretty hard to get for 10x datasets (Usually scientists give an overall cell count, not per cell suspension, the only reason we can get it easily for SS2 is that there is 1 cell/cell suspension)
Got it.
@kbergin
Right right, so the cell suspension cell count estimate is only filled in upon ingestion, not updated by analysis. Not sure if we've thought through how to update metadata based on analysis results, but it's an interesting idea we could pursue in the future.
I think having analysis update the cell suspension in the primary subgraph would be too involved. It would open up a can of worms. It should be sufficient to annotate the analysis_file
entity (the one describing the matrix file) with the number of cells in the matrix, like @mshadbolt suggests.
Additionally, if, at ingest time, the overall number (or estimated number) of cells is known for a given project, but not for individual cell suspensions in that project, we should also consider annotating the project entity with that number.
I'm still not clear about estimated vs raw vs filtered cell counts.
@mshadbolt @kbergin I think we should go ahead and annotate analysis_file
with the actual cell count. Would you be able to commit the necessary resources from your team to do so?
so for wranglers in terms of work this would mean:
analysis_file
schema to record cell countWe have also been thinking of adding the 'estimated cells' to the project entity, often papers in the abstract say something like, 'we analysed x thousand cells for our awesome cell atlas', or whatever, so the use case for this is for users to get a quick idea of the scale of the study.
Whereas the counts on the matrix files are for users digging a bit deeper and want to know more precise numbers per file.
I am still not clear on which number we would use to display at the project level, indeed, I am not really sure what a user is most interested in seeing either.
Agreed on 1. We should get that going sooner rather than later, once we agree on whether to annotate the project or not. Both schema changes should occur at the same time, in the same PR, IMO.
2 would be optional for CGMs, but require for DCP/2-generated matrices. @kbergin?
I would support annotating projects. If a project has both per-project and per-cell suspension estimates we'd have to figure out what to display. Otherwise, we'd just display the estimate that's available.
We won't be able to include this in this first pass of processing, we're really tight on the deadline as it is. We can add it as an update later or next time we reprocess it will be added.
@hannes-ucsc: "Project does not exist any more and there are no loom files in DCP42 with a zero cell count. There are only unspecified
and non-zero values."
The cell suspensions in the primary bundles don't have any estimated cell counts:
I'm guessing this is expected for 10x experiments. It's the reason the cell counts are shown as 0 in above screenshots (they should be
Unspecified
but that's another bug https://github.com/DataBiosphere/azul/issues/2271.I assume that the actual cell count is known after the matrices have been generated. That number should be exposed in the Data Browser somehow.
This issue manifests on the samples and projects tab as well.