DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

DCP/2-generated matrices have estimated cell count of 0 #2737

Closed hannes-ucsc closed 3 weeks ago

hannes-ucsc commented 3 years ago

path2831

The cell suspensions in the primary bundles don't have any estimated cell counts:

{
    "describedBy": "https://schema.humancellatlas.org/type/biomaterial/13.3.0/cell_suspension",
    "schema_type": "biomaterial",
    "biomaterial_core": {
        "biomaterial_id": "PP002_suspension",
        "biomaterial_name": "Stim_lung_donor_1_suspension",
        "biomaterial_description": "AntiCD3/AntiCD28-activated cells from lung of Donor 1 after negative selection to enrich CD3+ cells",
        "ncbi_taxon_id": [
            9606
        ]
    },
    "cell_morphology": {
        "percent_cell_viability": 82,
        "cell_viability_method": "Countess II automated cell coutner",
        "cell_viability_result": "pass"
    },
    "genus_species": [
        {
            "text": "Homo Sapiens",
            "ontology": "NCBITaxon:9606",
            "ontology_label": "Homo sapiens"
        }
    ],
    "selected_cell_types": [
        {
            "text": "CD3+ T cells (negative selection)",
            "ontology": "CL:0000084",
            "ontology_label": "T cell"
        }
    ],
    "provenance": {
        "document_id": "1bcd1629-fba5-4dd5-92d3-97f7d896464d",
        "submission_date": "2019-09-13T17:52:11.136Z",
        "update_date": "2019-09-13T17:52:15.206Z"
    }
}

I'm guessing this is expected for 10x experiments. It's the reason the cell counts are shown as 0 in above screenshots (they should be Unspecified but that's another bug https://github.com/DataBiosphere/azul/issues/2271.

I assume that the actual cell count is known after the matrices have been generated. That number should be exposed in the Data Browser somehow.

This issue manifests on the samples and projects tab as well.

melainalegaspi commented 3 years ago

@hannes-ucsc spike to outline solutions @theathorn to bring up to leadership

theathorn commented 3 years ago

I asked Kylee about calculating the number of cells from the output matrix:

Kylee: I don't think we calculate a metric like this already, but one could maybe pull the size of the matrix and call that the number of cells, in this case because it's from pipeline outputs it'd be all cells no matter whether they are empty or have even just 1 UMI. As long as we clearly define the number (maybe even with a tooltip?) then that could be fine.

Kishori Konwar: I agree with @kdegs, you will get total no of cells but most of them would be filtered out in a downstream analysis. For example, in a 10x sample you would get 200k cells whereas after filtering you would end up with only 20k.

khajoue2 commented 3 years ago

There is a way to get the number of cells from the final loom file after the filtering. It is the number of columns. It needs a bit more work to get it to the metadata and index it for the browser.

hannes-ucsc commented 3 years ago

So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?

mshadbolt commented 3 years ago

posting the comment I wrote on slack here:

the number in the browser is calculated by a field on the cell_suspension entity, sometimes this is not filled out by a contributor and there is no way to infer it from a publication, we would interpret this field as the number of cells that goes into the machine rather than the number of cells that come out after filtering/analysis. Perhaps it is worth adding an optional field to the analysis_file schema so that the number of cells can be filled in for each matrix, this would be unfiltered cells in the intermediate matrix and filtered cells in the 'project-level' matrix, it would then be a matter for the browser to decide which cell count is displayed, whereas a user would have access to all counts if they interrogate the metadata.

ESapenaVentura commented 3 years ago

So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?

Just to clarify: this count is not always absent, but it is indeed a piece of metadata that is pretty hard to get for 10x datasets (Usually scientists give an overall cell count, not per cell suspension, the only reason we can get it easily for SS2 is that there is 1 cell/cell suspension)

I asked Kylee about calculating the number of cells from the output matrix: Kylee: I don't think we calculate a metric like this already, but one could maybe pull the size of the matrix and call that the number of cells, in this case because it's from pipeline outputs it'd be all cells no matter whether they are empty or have even just 1 UMI. As long as we clearly define the number (maybe even with a tooltip?) then that could be fine. Kishori Konwar: I agree with @KDegs, you will get total no of cells but most of them would be filtered out in a downstream analysis. For example, in a 10x sample you would get 200k cells whereas after filtering you would end up with only 20k

I have to say I do prefer to get the pre-filtered cell count estimate, as that is usually how we specify it at the project level (as @mshadbolt explained above, we try to fill the number of cells that goes into the machine)

kbergin commented 3 years ago

Right right, so the cell suspension cell count estimate is only filled in upon ingestion, not updated by analysis. Not sure if we've thought through how to update metadata based on analysis results, but it's an interesting idea we could pursue in the future.

For now, I agree it's not clear that the cell count at the cell suspension level is helpful, maybe it'd be best to not include that column in the 'files' tab?

hannes-ucsc commented 3 years ago

@ESapenaVentura

So there are three cell counts: the estimate (which is absent in 10x), the raw and the filtered counts?

Just to clarify: this count is not always absent, but it is indeed a piece of metadata that is pretty hard to get for 10x datasets (Usually scientists give an overall cell count, not per cell suspension, the only reason we can get it easily for SS2 is that there is 1 cell/cell suspension)

Got it.

@kbergin

Right right, so the cell suspension cell count estimate is only filled in upon ingestion, not updated by analysis. Not sure if we've thought through how to update metadata based on analysis results, but it's an interesting idea we could pursue in the future.

I think having analysis update the cell suspension in the primary subgraph would be too involved. It would open up a can of worms. It should be sufficient to annotate the analysis_file entity (the one describing the matrix file) with the number of cells in the matrix, like @mshadbolt suggests.

Additionally, if, at ingest time, the overall number (or estimated number) of cells is known for a given project, but not for individual cell suspensions in that project, we should also consider annotating the project entity with that number.

I'm still not clear about estimated vs raw vs filtered cell counts.

hannes-ucsc commented 3 years ago

@mshadbolt @kbergin I think we should go ahead and annotate analysis_file with the actual cell count. Would you be able to commit the necessary resources from your team to do so?

mshadbolt commented 3 years ago

so for wranglers in terms of work this would mean:

  1. Adding an optional field to the analysis_file schema to record cell count
  2. wranglers annotating this field for CGMs on an ongoing basis ? - or do we mostly see this as important for the files DCP generates?

We have also been thinking of adding the 'estimated cells' to the project entity, often papers in the abstract say something like, 'we analysed x thousand cells for our awesome cell atlas', or whatever, so the use case for this is for users to get a quick idea of the scale of the study.

Whereas the counts on the matrix files are for users digging a bit deeper and want to know more precise numbers per file.

I am still not clear on which number we would use to display at the project level, indeed, I am not really sure what a user is most interested in seeing either.

hannes-ucsc commented 3 years ago

Agreed on 1. We should get that going sooner rather than later, once we agree on whether to annotate the project or not. Both schema changes should occur at the same time, in the same PR, IMO.

2 would be optional for CGMs, but require for DCP/2-generated matrices. @kbergin?

I would support annotating projects. If a project has both per-project and per-cell suspension estimates we'd have to figure out what to display. Otherwise, we'd just display the estimate that's available.

kbergin commented 3 years ago

We won't be able to include this in this first pass of processing, we're really tight on the deadline as it is. We can add it as an update later or next time we reprocess it will be added.

bvizzier-ucsc commented 3 weeks ago

@hannes-ucsc: "Project does not exist any more and there are no loom files in DCP42 with a zero cell count. There are only unspecified and non-zero values."