Re-visit implementation decisions around `Cell` object

bussec commented 8 months ago

A discussion around the necessity and the architecture of the Cell object and its associated object recently arose in #705. Indeed the current relations and links between Cell, CellExpression, CellReactivity and other objects (incl. Receptor and Rearrangement) appear overly complicated. Various solutions have been proposed, including:

Cell should become a container for other Cell* objects, which would then not exists as independent top-level objects anymore
Cell should be completely removed and cells would only be represented in a data set via a cell_id in various other objects, but not as an object on its own.

scharch commented 8 months ago

Under option 2, does the ADC /cell endpoint go away? If so, how would I find all B Cells expressing, say, CD11c?

javh commented 8 months ago

Some more options:

Cell becomes not a container, but a cell metadata object at the same level as CellExpression and CellReactivity with records that are keyed on cell_id in the same way. (Pretty similar to option 2, because Cell wouldn't exist in the same form.)
We keep both structures that currently seem to exist as alternative ways to represent single cell data. Ie, users can either (a) use Cell as a container that includes arrays of CellExpression/CellReactivity (and thus exclude cell_id, repertoire_id, etc from CellExpression/CellReactivity) or (b) use CellExpression/CellReactivity as top-level objects that include cell_id, repertoire_id, etc, but then any refs to CellExpression/CellReactivity/Rearrangement in Cell should not be used.

Option 3 would appear to accomplish want @scharch is asking for.

Option 4 seems super messy to me, but perhaps the state of different implementations already.

scharch commented 8 months ago

I think that I prefer option 1, though it does conflict with our normal way of coding 1:N relationships. Option 2 doesn't seem realistic to me, both because of the endpoint question and because I can imagine properties of Cells that aren't encoded elsewhere (eg #769 or whether a cell is B or T). Option 3 seems like the most likely solution, and I think is basically what we already have? I don't think I understand Option 4...

bcorrie commented 8 months ago

This looks suspiciously like this discussion

https://github.com/airr-community/airr-standards/issues/409#issuecomment-874562821

Are you sure you want to re-open this can of worms... 8-)

bcorrie commented 8 months ago

Well, I took the time to try and dig out some history.

A diagram (https://github.com/airr-community/airr-standards/files/6768307/cell-object-representation-and-linkage.pdf) with something that is close to what we have now, originally with expression data in expression_tabular contained in the Cell but ending up landing on Option 2 in the PDF. https://github.com/airr-community/airr-standards/issues/409#issuecomment-874562821
Discussion about removing the expression data (expression_tabular) from the Cell object and have an external object. https://github.com/airr-community/airr-standards/issues/409#issuecomment-893789702
Pull request and discussion around the implementation of CellExpression. https://github.com/airr-community/airr-standards/pull/574
Notes on final discussion at the 2022-04 Standards call: https://github.com/airr-community/airr-standards/pull/574#pullrequestreview-944700799

bcorrie commented 8 months ago

@javh says:

Cell should become a container for other Cell* objects, which would then not exists as independent top-level objects anymore

@scharch says:

I think that I prefer option 1, though it does conflict with our normal way of coding 1:N relationships.

I need to understand what you mean by "container".

Is this suggesting having all of the expression data stored as part of the object as in:

With the addition of an array of CellReactivity obects to Cell as well?

There would be no external CellExpression or CellReactivity objects, because they are all stored within the Cell "container"?

Or do you mean that the Cell object has fields for an array of expression_id and reactivity_id that are related to the Cell?

{
"cell_id":"1",
"rearrangement_ids":[],
"expression_ids":[],
"reactivity_ids":[]
...
}

We did discuss all of these options extensively (see above), so it is unclear to me what the issue is with the current solution.

scharch commented 8 months ago

Or do you mean that the Cell object has fields for an array of expression_id and reactivity_id that are related to the Cell?

This one, to my understanding

javh commented 8 months ago

I don't think I understand Option 4...

I think Option 4 is closest to what we have now, hence the confusion.

javh commented 8 months ago

Or do you mean that the Cell object has fields for an array of expression_id and reactivity_id that are related to the Cell?

This would be option 1. Which would also imply that CellExpression and CellReactivity would not include cell_id and repertoire_id as those are present in the parent object Cell.

bcorrie commented 8 months ago

Indeed the current relations and links between Cell, CellExpression, CellReactivity and other objects (incl. Receptor and Rearrangement) appear overly complicated.

I really don't see this as complicated at all. In fact it is about as simple as it can get.

If you identify a Cell in a study, you have a Cell object with a cell_id.
If you identify a Rearrangement that is associated with a Cell, the rearrangement record points to that Cell using the cell_id in that rearrangement record.
If you identify a expressed gene (CellExpression) that is associated with a Cell, the cell expression record points to that Cell using the cell_id in the cell expression record.
If you identify a cell reactivity (CellReactivity) that is associated withe a Cell, the cell reactivity record points to that Cell using the cell_id in the cell reactivity record.

javh commented 8 months ago

I really don't see this as complicated at all. In fact it is about as simple as it can get.

What you've described would be option 3, which would involve paring down Cell (eg, removing rearrangements).

bcorrie commented 8 months ago

Or do you mean that the Cell object has fields for an array of expression_id and reactivity_id that are related to the Cell?

This would be option 1. Which would also imply that CellExpression and CellReactivity would not include cell_id and repertoire_id as those are present in the parent object Cell.

You realize that this what a Cell will look like right - 1361 IDs to expression records for this one Cell.

This is what you get when you have an N:1 relationship with large "N" store in the "1"

{
    "cell_id": "641e1abd42e3dd9c961f2d8b",
    "rearrangements": [array of SMALL number of rearrangement],
    "receptors": [array of SMALL  number of receptors],
    "reactivity_ids":[array of UNKNOWN number of reactivity measurements],
    "expression_ids":
    [
    #### 1300 EXPRESSION IDS DELETED
    #### This Cell has 1361 measured expression values
    #### I enclose 61 to be representative 
    #### I realize that the "expression_id" key would normally not be there.
    "expression_id": "6412758e215fcb89f8e26ac4"
    "expression_id": "6412758e215fcb89f8e26ac5"
    "expression_id": "6412758e215fcb89f8e26ac6"
    "expression_id": "6412758e215fcb89f8e26ac7"
    "expression_id": "6412758e215fcb89f8e26ac8"
    "expression_id": "6412758e215fcb89f8e26ac9"
    "expression_id": "6412758e215fcb89f8e26aca"
    "expression_id": "6412758e215fcb89f8e26acb"
    "expression_id": "6412758e215fcb89f8e26acc"
    "expression_id": "6412758e215fcb89f8e26acd"
    "expression_id": "6412758e215fcb89f8e26ace"
    "expression_id": "6412758e215fcb89f8e26acf"
    "expression_id": "6412758e215fcb89f8e26ad0"
    "expression_id": "6412758e215fcb89f8e26ad1"
    "expression_id": "6412758e215fcb89f8e26ad2"
    "expression_id": "6412758e215fcb89f8e26ad3"
    "expression_id": "6412758e215fcb89f8e26ad4"
    "expression_id": "6412758e215fcb89f8e26ad5"
    "expression_id": "6412758e215fcb89f8e26ad6"
    "expression_id": "6412758e215fcb89f8e26ad7"
    "expression_id": "6412758e215fcb89f8e26ad8"
    "expression_id": "6412758e215fcb89f8e26ad9"
    "expression_id": "6412758e215fcb89f8e26ada"
    "expression_id": "6412758e215fcb89f8e26adb"
    "expression_id": "6412758e215fcb89f8e26adc"
    "expression_id": "6412758e215fcb89f8e26add"
    "expression_id": "6412758e215fcb89f8e26ade"
    "expression_id": "6412758e215fcb89f8e26adf"
    "expression_id": "6412758e215fcb89f8e26ae0"
    "expression_id": "6412758e215fcb89f8e26ae1"
    "expression_id": "6412758e215fcb89f8e26ae2"
    "expression_id": "6412758e215fcb89f8e26ae3"
    "expression_id": "6412758e215fcb89f8e26ae4"
    "expression_id": "6412758e215fcb89f8e26ae5"
    "expression_id": "6412758e215fcb89f8e26ae6"
    "expression_id": "6412758e215fcb89f8e26ae7"
    "expression_id": "6412758e215fcb89f8e26ae8"
    "expression_id": "6412758e215fcb89f8e26ae9"
    "expression_id": "6412758e215fcb89f8e26aea"
    "expression_id": "6412758e215fcb89f8e26aeb"
    "expression_id": "6412758e215fcb89f8e26aec"
    "expression_id": "6412758e215fcb89f8e26aed"
    "expression_id": "6412758e215fcb89f8e26aee"
    "expression_id": "6412758e215fcb89f8e26aef"
    "expression_id": "6412758e215fcb89f8e26af0"
    "expression_id": "6412758e215fcb89f8e26af1"
    "expression_id": "6412758e215fcb89f8e26af2"
    "expression_id": "6412758e215fcb89f8e26af3"
    "expression_id": "6412758e215fcb89f8e26af4"
    "expression_id": "6412758e215fcb89f8e26af5"
    "expression_id": "6412758e215fcb89f8e26af6"
    "expression_id": "6412758e215fcb89f8e26af7"
    "expression_id": "6412758e215fcb89f8e26af8"
    "expression_id": "6412758e215fcb89f8e26af9"
    "expression_id": "6412758e215fcb89f8e26afa"
    "expression_id": "6412758e215fcb89f8e26afb"
    "expression_id": "6412758e215fcb89f8e26afc"
    "expression_id": "6412758e215fcb89f8e26afd"
    "expression_id": "6412758e215fcb89f8e26afe"
    "expression_id": "6412758e215fcb89f8e26aff"

    ],
    "repertoire_id": "635af09e3891ed552fe4dca1",
    "data_processing_id": "635af09e3891ed552fe4dca1",
    "expression_study_method": "single-cell transcriptome",
    "expression_raw_doi": null,
    "expression_index": null,
    "virtual_pairing": false,
}

scharch commented 8 months ago

I think Option 4 is closest to what we have now, hence the confusion.

Really??

users can either (a) use Cell as a container that includes arrays of CellExpression/CellReactivity

This is not currently the case, as I read the schema

Rearrangement is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.

bcorrie commented 8 months ago

I really don't see this as complicated at all. In fact it is about as simple as it can get.

What you've described would be option 3, which would involve paring down Cell (eg, removing rearrangements).

I don't object to leaving the rearrangement and receptor arrays. They are small N. I do object to storing an N:1 large N relationship in the 1

scharch commented 8 months ago

This is what you get when you have an N:1 relationship with large "N" store in the "1"

@bcorrie what study is this? That seems to me like using CellExpression incorrectly when expression_raw_doi should be used instead to link out to an external resource like GEO.

bcorrie commented 8 months ago

users can either (a) use Cell as a container that includes arrays of CellExpression/CellReactivity

This is not currently the case, as I read the schema

Rearrangement is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.

Yes, the current spec does not have the ability to store an array of Expression or Reactivity objects in Cell (no bi-directional links). That is because they are large N.

Rearrangement and Receptor were kept in Cell for convenience as I recall, because they are small N.

I don't object to removing the Rearrangement and Receptor arrays in Cell to be consistent.

javh commented 8 months ago

This is not currently the case, as I read the schema

Rearrangement is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.

Yeah, not in terms of exactly which fields are present. I meant in terms of the logic. I don't see the schema as currently aligned with any option fully, hence the thread...

I think the easiest thing to do (and my personal preference) is to remove the bi-directional fields from Cell and go with option 3. Which seems to be were we landed in #409, but the bi-directional fields were left hanging.

bcorrie commented 8 months ago

This is what you get when you have an N:1 relationship with large "N" store in the "1"

@bcorrie what study is this? That seems to me like using CellExpression incorrectly when expression_raw_doi should be used instead to link out to an external resource like GEO.

The Cell object has the ability to point to an external GEO record:

https://github.com/airr-community/airr-standards/blob/372604733fabdd51cf4e3fa270bffc8ffd991a38/specs/airr-schema.yaml#L4565

The CellExpression object captures the actual property measurement:

https://github.com/airr-community/airr-standards/blob/372604733fabdd51cf4e3fa270bffc8ffd991a38/specs/airr-schema.yaml#L4647

        property_type:
            type: string
            description: >
                Keyword describing the property type and detection method used to measure the property value.
                The following keywords are recommended, but custom property types are also valid:
                "mrna_expression_by_read_count",
                "protein_expression_by_fluorescence_intensity", "antigen_bait_binding_by_fluorescence_intensity",
                "protein_expression_by_dna_barcode_count" and "antigen_bait_binding_by_dna_barcode_count".
            title: Property type and detection method
            x-airr:
                miairr: defined
                nullable: false
                adc-query-support: true
                name: Property type and detection method
        property:
            $ref: '#/Ontology'
            title: Property information
            description: >
                Name of the property observed, typically a gene or antibody identifier (and label) from a 
                canonical resource such as Ensembl (e.g. ENSG00000275747, IGHV3-79) or 
                Antibody Registry (ABREG:1236456, Purified anti-mouse/rat/human CD27 antibody).
            example:
                id: ENSG:ENSG00000275747
                label: IGHV3-79
            x-airr:
                miairr: defined
                adc-query-support: true
                format: ontology
                name: Property information
        value:
            type: number
            description: Level at which the property was observed in the experiment (non-normalized).
            title: Property value
            example: 3
            x-airr:
                miairr: defined
                nullable: true
                adc-query-support: true
                name: Property value

So you have the opportunity to do both. CellExpression is the measured genes and their expression levels from the study.

Your average Cell is going to have many 100s of these. So your array of expression_ids is going to be very large.

bcorrie commented 8 months ago

We store these in the repository (for a specific cell):

$ curl -d '{"filters":{"op":"=","content":{"field":"cell_id","value":"641e1abd42e3dd9c961f2d8b"}}}' https://covid19-1.ireceptor.org/airr/v1/expression | egrep "expression_id" | wc -l
1361

Each record looks like this:

$ curl -d '{"size":1}' https://covid19-1.ireceptor.org/airr/v1/expression
{"Info":{ 
[INFO STUFF DELETED]
}, "CellExpression":[
{
    "expression_id": "6412758e215fcb89f8e26ac3",
    "cell_id": "641e1abd42e3dd9c961f2d8b",
    "repertoire_id": "PRJCA002413-ERS1-IG-CELL",
    "data_processing_id": "PRJCA002413-ERS1",
    "property": {
        "label": "AURKAIP1",
        "id": "ENSG:ENSG00000175756"
    },
    "value": 1,
    "adc_annotation_cell_id": "AAACCTGCACCGATAT-1",
    "ir_annotation_set_metadata_id_expression": "635af09e3891ed552fe4dca1",
    "sample_processing_id": "PRJCA002413-ERS1-IG-CELL",
    "ir_created_at_expression": "2023-03-16T01:49:02.180026+00:00",
    "ir_updated_at_expression": "2023-03-16T01:49:02.180026+00:00"
}]}

scharch commented 8 months ago

I don't run a repository, so I guess it's no sweat of my back, but that's a turrrrible way to store RNAseq data and not at all what I thought CellExpression was for.

Beyond that, seems like we have consensus on Option 3, plus/minus bi-directional Cell-Rearrangement links...

javh commented 8 months ago

I'd like to hear @bussec weigh in. The central problem here is that we're not all operating on the same set of assumptions of what the model should be. Once we all get on the same page w.r.t. assumptions, then it should be easy to figure out what to do.

bcorrie commented 8 months ago

This is not currently the case, as I read the schema Rearrangement is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.

Yeah, not in terms of exactly which fields are present. I meant in terms of the logic. I don't see the schema as currently aligned with any option fully, hence the thread...

I think the easiest thing to do (and my personal preference) is to remove the bi-directional fields from Cell and go with option 3. Which seems to be were we landed in #409, but the bi-directional fields were left hanging.

I don't object to this, we actually don't store them to avoid consistency problems:

curl -d '{"size":1}' https://covid19-1.ireceptor.org/airr/v1/cell

{
    "cell_id": "641e1abd42e3dd9c961f2d8b",
    "rearrangements": null,
    "receptors": null,
    "repertoire_id": "PRJCA002413-ERS1-IG-CELL",
    "data_processing_id": "PRJCA002413-ERS1",
    "expression_study_method": "single-cell transcriptome",
    "expression_raw_doi": null,
    "expression_index": null,
    "virtual_pairing": false,
    "ir_annotation_set_metadata_id_cell": "635af09e3891ed552fe4dca1",
    "adc_annotation_cell_id": "AAACCTGCACCGATAT-1",
    "sample_processing_id": "PRJCA002413-ERS1-IG-CELL",
    "ir_created_at_cell": "2023-03-24T21:48:45.792352+00:00",
    "ir_updated_at_cell": "2023-03-24T21:48:45.792352+00:00"
}

bcorrie commented 8 months ago

I don't run a repository, so I guess it's no sweat of my back, but that's a turrrrible way to store RNAseq data and not at all what I thought CellExpression was for.

The data isn't necessarily stored in that way in the repository, that is how the /expression endpoint gives it to you if you ask for a JSON response (which is the default). We store it as a big flat mongo document that is very fast to query if indexed correctly (which we are pretty good at). Maybe we run into challenges when we have 5B expression values, but for now it is not a challenge at all...

Although we have none of these at the moment, the API could trivially produce TSV files, matrix files, h5ad, etc. We discussed that here: https://github.com/airr-community/airr-standards/issues/409

The data is there, the data is queryable, but we don't have great use cases, and our discussions with 10X suggested that implementing any specific h5ad type of file would be a pain to support and by definition niche. Maybe that is changing but that is where we are at today.

javh commented 8 months ago

I don't object to this, we actually don't store them to avoid consistency problems:

Perfect illustration of the problem. Nulling the field when you could populate the data implies the schema is flawed. You're probably making the correct choice, and it is technically legal to null those required fields, but it's also bypassing the model implied by the Cell schema because the schema isn't working for you.

bcorrie commented 8 months ago

I don't object to this, we actually don't store them to avoid consistency problems:

Perfect illustration of the problem. Nulling the field when you could populate the data implies the schema is flawed.

Yes, I don't necessarily disagree.

You're probably making the correct choice, and it is technically legal to null those required fields, but it's also bypassing the model implied by the Cell schema because the schema isn't working for you.

The schema is working for us just fine (see screenshot below), and as you say we are following the schema as it is specified. We just don't use that feature.

Almost all of the issues we raise are things that we run into when we are trying to use the specification "in real life". But we have to pick our battles as to which issues we raise. Having a field that someone at some point expressed a desire to have that we don't need isn't going to be a battle that I pick to debate over. We have enough issues as it is and we raise issues that come up when things don't work for us 8-)

I can't say that the use case for rearrangements and receptors arrays isn't important for something. We added it at some point for a reason. We decided (maybe not intentionally) not to remove it when we removed expression_tabular. So for me it isn't a burning issue, and I can see why it might be useful to have. If we decide to remove it that is fine too. I don't necessarily mind the ability to have that two way relationship if it is desired, but I can see it can be confusing.

bcorrie commented 8 months ago

The data is there, the data is queryable, but we don't have great use cases, and our discussions with 10X suggested that implementing any specific h5ad type of file would be a pain to support and by definition niche. Maybe that is changing but that is where we are at today.

With scirpy's support of the AIRR TSV format, it is likely that this might be the best way to return GEX data:

https://scirpy.scverse.org/en/latest/glossary.html#term-AIRR

But the API can't do that yet. We actually have a AIRR JSON Expression to h5ad converter (yes its scary and slow and hacky) 8-)

https://github.com/sfu-ireceptor/gateway/blob/master/public/gateway_utilities/gateway-airr-to-h5ad.py

So if you download Cell/Expression data from the Gateway, you can use this to convert the data to h5ad files and then use those as input to tools like Conga and CellTypist (these are Apps we have built into the Gateway).

javh commented 8 months ago

Here's some diagrams with what I think the implications are for options 1 and 3. For fields, minus = remove and plus = add.

scharch commented 8 months ago

I still don't understand how option 3 is different from what we currently have....

javh commented 8 months ago

I still don't understand how option 3 is different from what we currently have....

Mostly because we currently have bi-directional links in Cell. Though, I think there are assumptions of hierarchy that get us stuck in circles. If it's just me who is confused, well, then, that's super easy to fix (I hope!).

javh commented 8 months ago

The +/- fields are what would change. I don't think I made that clear initially.

airr-community / airr-standards

Re-visit implementation decisions around `Cell` object #768