Closed bussec closed 7 months ago
Under option 2, does the ADC /cell
endpoint go away? If so, how would I find all B Cell
s expressing, say, CD11c?
Some more options:
Cell
becomes not a container, but a cell metadata object at the same level as CellExpression
and CellReactivity
with records that are keyed on cell_id
in the same way. (Pretty similar to option 2, because Cell
wouldn't exist in the same form.)Cell
as a container that includes arrays of CellExpression
/CellReactivity
(and thus exclude cell_id
, repertoire_id
, etc from CellExpression
/CellReactivity
) or (b) use CellExpression
/CellReactivity
as top-level objects that include cell_id
, repertoire_id
, etc, but then any refs to CellExpression
/CellReactivity
/Rearrangement
in Cell
should not be used.Option 3 would appear to accomplish want @scharch is asking for.
Option 4 seems super messy to me, but perhaps the state of different implementations already.
I think that I prefer option 1, though it does conflict with our normal way of coding 1:N relationships.
Option 2 doesn't seem realistic to me, both because of the endpoint question and because I can imagine properties of Cell
s that aren't encoded elsewhere (eg #769 or whether a cell is B or T).
Option 3 seems like the most likely solution, and I think is basically what we already have?
I don't think I understand Option 4...
This looks suspiciously like this discussion
https://github.com/airr-community/airr-standards/issues/409#issuecomment-874562821
Are you sure you want to re-open this can of worms... 8-)
Well, I took the time to try and dig out some history.
expression_tabular
contained in the Cell but ending up landing on Option 2 in the PDF. https://github.com/airr-community/airr-standards/issues/409#issuecomment-874562821expression_tabular
) from the Cell object and have an external object. https://github.com/airr-community/airr-standards/issues/409#issuecomment-893789702@javh says:
Cell
should become a container for otherCell*
objects, which would then not exists as independent top-level objects anymore
@scharch says:
I think that I prefer option 1, though it does conflict with our normal way of coding 1:N relationships.
I need to understand what you mean by "container".
Is this suggesting having all of the expression data stored as part of the object as in:
With the addition of an array of CellReactivity
obects to Cell
as well?
There would be no external CellExpression
or CellReactivity
objects, because they are all stored within the Cell
"container"?
Or do you mean that the Cell
object has fields for an array of expression_id
and reactivity_id
that are related to the Cell
?
{
"cell_id":"1",
"rearrangement_ids":[],
"expression_ids":[],
"reactivity_ids":[]
...
}
We did discuss all of these options extensively (see above), so it is unclear to me what the issue is with the current solution.
Or do you mean that the
Cell
object has fields for an array ofexpression_id
andreactivity_id
that are related to theCell
?
This one, to my understanding
I don't think I understand Option 4...
I think Option 4 is closest to what we have now, hence the confusion.
Or do you mean that the
Cell
object has fields for an array ofexpression_id
andreactivity_id
that are related to theCell
?
This would be option 1. Which would also imply that CellExpression
and CellReactivity
would not include cell_id
and repertoire_id
as those are present in the parent object Cell
.
Indeed the current relations and links between
Cell
,CellExpression
,CellReactivity
and other objects (incl.Receptor
andRearrangement
) appear overly complicated.
I really don't see this as complicated at all. In fact it is about as simple as it can get.
Cell
in a study, you have a Cell
object with a cell_id
.Rearrangement
that is associated with a Cell
, the rearrangement record points to that Cell
using the cell_id
in that rearrangement record.CellExpression
) that is associated with a Cell
, the cell expression record points to that Cell
using the cell_id
in the cell expression record.CellReactivity
) that is associated withe a Cell
, the cell reactivity record points to that Cell
using the cell_id
in the cell reactivity record.I really don't see this as complicated at all. In fact it is about as simple as it can get.
What you've described would be option 3, which would involve paring down Cell
(eg, removing rearrangements
).
Or do you mean that the
Cell
object has fields for an array ofexpression_id
andreactivity_id
that are related to theCell
?This would be option 1. Which would also imply that
CellExpression
andCellReactivity
would not includecell_id
andrepertoire_id
as those are present in the parent objectCell
.
You realize that this what a Cell will look like right - 1361 IDs to expression records for this one Cell.
This is what you get when you have an N:1 relationship with large "N" store in the "1"
{
"cell_id": "641e1abd42e3dd9c961f2d8b",
"rearrangements": [array of SMALL number of rearrangement],
"receptors": [array of SMALL number of receptors],
"reactivity_ids":[array of UNKNOWN number of reactivity measurements],
"expression_ids":
[
#### 1300 EXPRESSION IDS DELETED
#### This Cell has 1361 measured expression values
#### I enclose 61 to be representative
#### I realize that the "expression_id" key would normally not be there.
"expression_id": "6412758e215fcb89f8e26ac4"
"expression_id": "6412758e215fcb89f8e26ac5"
"expression_id": "6412758e215fcb89f8e26ac6"
"expression_id": "6412758e215fcb89f8e26ac7"
"expression_id": "6412758e215fcb89f8e26ac8"
"expression_id": "6412758e215fcb89f8e26ac9"
"expression_id": "6412758e215fcb89f8e26aca"
"expression_id": "6412758e215fcb89f8e26acb"
"expression_id": "6412758e215fcb89f8e26acc"
"expression_id": "6412758e215fcb89f8e26acd"
"expression_id": "6412758e215fcb89f8e26ace"
"expression_id": "6412758e215fcb89f8e26acf"
"expression_id": "6412758e215fcb89f8e26ad0"
"expression_id": "6412758e215fcb89f8e26ad1"
"expression_id": "6412758e215fcb89f8e26ad2"
"expression_id": "6412758e215fcb89f8e26ad3"
"expression_id": "6412758e215fcb89f8e26ad4"
"expression_id": "6412758e215fcb89f8e26ad5"
"expression_id": "6412758e215fcb89f8e26ad6"
"expression_id": "6412758e215fcb89f8e26ad7"
"expression_id": "6412758e215fcb89f8e26ad8"
"expression_id": "6412758e215fcb89f8e26ad9"
"expression_id": "6412758e215fcb89f8e26ada"
"expression_id": "6412758e215fcb89f8e26adb"
"expression_id": "6412758e215fcb89f8e26adc"
"expression_id": "6412758e215fcb89f8e26add"
"expression_id": "6412758e215fcb89f8e26ade"
"expression_id": "6412758e215fcb89f8e26adf"
"expression_id": "6412758e215fcb89f8e26ae0"
"expression_id": "6412758e215fcb89f8e26ae1"
"expression_id": "6412758e215fcb89f8e26ae2"
"expression_id": "6412758e215fcb89f8e26ae3"
"expression_id": "6412758e215fcb89f8e26ae4"
"expression_id": "6412758e215fcb89f8e26ae5"
"expression_id": "6412758e215fcb89f8e26ae6"
"expression_id": "6412758e215fcb89f8e26ae7"
"expression_id": "6412758e215fcb89f8e26ae8"
"expression_id": "6412758e215fcb89f8e26ae9"
"expression_id": "6412758e215fcb89f8e26aea"
"expression_id": "6412758e215fcb89f8e26aeb"
"expression_id": "6412758e215fcb89f8e26aec"
"expression_id": "6412758e215fcb89f8e26aed"
"expression_id": "6412758e215fcb89f8e26aee"
"expression_id": "6412758e215fcb89f8e26aef"
"expression_id": "6412758e215fcb89f8e26af0"
"expression_id": "6412758e215fcb89f8e26af1"
"expression_id": "6412758e215fcb89f8e26af2"
"expression_id": "6412758e215fcb89f8e26af3"
"expression_id": "6412758e215fcb89f8e26af4"
"expression_id": "6412758e215fcb89f8e26af5"
"expression_id": "6412758e215fcb89f8e26af6"
"expression_id": "6412758e215fcb89f8e26af7"
"expression_id": "6412758e215fcb89f8e26af8"
"expression_id": "6412758e215fcb89f8e26af9"
"expression_id": "6412758e215fcb89f8e26afa"
"expression_id": "6412758e215fcb89f8e26afb"
"expression_id": "6412758e215fcb89f8e26afc"
"expression_id": "6412758e215fcb89f8e26afd"
"expression_id": "6412758e215fcb89f8e26afe"
"expression_id": "6412758e215fcb89f8e26aff"
],
"repertoire_id": "635af09e3891ed552fe4dca1",
"data_processing_id": "635af09e3891ed552fe4dca1",
"expression_study_method": "single-cell transcriptome",
"expression_raw_doi": null,
"expression_index": null,
"virtual_pairing": false,
}
I think Option 4 is closest to what we have now, hence the confusion.
Really??
users can either (a) use
Cell
as a container that includes arrays ofCellExpression
/CellReactivity
This is not currently the case, as I read the schema
Rearrangement
is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.
I really don't see this as complicated at all. In fact it is about as simple as it can get.
What you've described would be option 3, which would involve paring down
Cell
(eg, removingrearrangements
).
I don't object to leaving the rearrangement and receptor arrays. They are small N. I do object to storing an N:1 large N relationship in the 1
This is what you get when you have an N:1 relationship with large "N" store in the "1"
@bcorrie what study is this? That seems to me like using CellExpression
incorrectly when expression_raw_doi
should be used instead to link out to an external resource like GEO.
users can either (a) use
Cell
as a container that includes arrays ofCellExpression
/CellReactivity
This is not currently the case, as I read the schema
Rearrangement
is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.
Yes, the current spec does not have the ability to store an array of Expression or Reactivity objects in Cell
(no bi-directional links). That is because they are large N.
Rearrangement and Receptor were kept in Cell
for convenience as I recall, because they are small N.
I don't object to removing the Rearrangement and Receptor arrays in Cell to be consistent.
This is not currently the case, as I read the schema
Rearrangement
is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.
Yeah, not in terms of exactly which fields are present. I meant in terms of the logic. I don't see the schema as currently aligned with any option fully, hence the thread...
I think the easiest thing to do (and my personal preference) is to remove the bi-directional fields from Cell
and go with option 3. Which seems to be were we landed in #409, but the bi-directional fields were left hanging.
This is what you get when you have an N:1 relationship with large "N" store in the "1"
@bcorrie what study is this? That seems to me like using
CellExpression
incorrectly whenexpression_raw_doi
should be used instead to link out to an external resource like GEO.
The Cell
object has the ability to point to an external GEO record:
The CellExpression
object captures the actual property measurement:
property_type:
type: string
description: >
Keyword describing the property type and detection method used to measure the property value.
The following keywords are recommended, but custom property types are also valid:
"mrna_expression_by_read_count",
"protein_expression_by_fluorescence_intensity", "antigen_bait_binding_by_fluorescence_intensity",
"protein_expression_by_dna_barcode_count" and "antigen_bait_binding_by_dna_barcode_count".
title: Property type and detection method
x-airr:
miairr: defined
nullable: false
adc-query-support: true
name: Property type and detection method
property:
$ref: '#/Ontology'
title: Property information
description: >
Name of the property observed, typically a gene or antibody identifier (and label) from a
canonical resource such as Ensembl (e.g. ENSG00000275747, IGHV3-79) or
Antibody Registry (ABREG:1236456, Purified anti-mouse/rat/human CD27 antibody).
example:
id: ENSG:ENSG00000275747
label: IGHV3-79
x-airr:
miairr: defined
adc-query-support: true
format: ontology
name: Property information
value:
type: number
description: Level at which the property was observed in the experiment (non-normalized).
title: Property value
example: 3
x-airr:
miairr: defined
nullable: true
adc-query-support: true
name: Property value
So you have the opportunity to do both. CellExpression is the measured genes and their expression levels from the study.
Your average Cell is going to have many 100s of these. So your array of expression_ids
is going to be very large.
We store these in the repository (for a specific cell):
$ curl -d '{"filters":{"op":"=","content":{"field":"cell_id","value":"641e1abd42e3dd9c961f2d8b"}}}' https://covid19-1.ireceptor.org/airr/v1/expression | egrep "expression_id" | wc -l
1361
Each record looks like this:
$ curl -d '{"size":1}' https://covid19-1.ireceptor.org/airr/v1/expression
{"Info":{
[INFO STUFF DELETED]
}, "CellExpression":[
{
"expression_id": "6412758e215fcb89f8e26ac3",
"cell_id": "641e1abd42e3dd9c961f2d8b",
"repertoire_id": "PRJCA002413-ERS1-IG-CELL",
"data_processing_id": "PRJCA002413-ERS1",
"property": {
"label": "AURKAIP1",
"id": "ENSG:ENSG00000175756"
},
"value": 1,
"adc_annotation_cell_id": "AAACCTGCACCGATAT-1",
"ir_annotation_set_metadata_id_expression": "635af09e3891ed552fe4dca1",
"sample_processing_id": "PRJCA002413-ERS1-IG-CELL",
"ir_created_at_expression": "2023-03-16T01:49:02.180026+00:00",
"ir_updated_at_expression": "2023-03-16T01:49:02.180026+00:00"
}]}
I don't run a repository, so I guess it's no sweat of my back, but that's a turrrrible way to store RNAseq data and not at all what I thought CellExpression
was for.
Beyond that, seems like we have consensus on Option 3, plus/minus bi-directional Cell
-Rearrangement
links...
I'd like to hear @bussec weigh in. The central problem here is that we're not all operating on the same set of assumptions of what the model should be. Once we all get on the same page w.r.t. assumptions, then it should be easy to figure out what to do.
This is not currently the case, as I read the schema
Rearrangement
is the exception, with bi-directional links. I will have to dig further to recall why we left that that way.Yeah, not in terms of exactly which fields are present. I meant in terms of the logic. I don't see the schema as currently aligned with any option fully, hence the thread...
I think the easiest thing to do (and my personal preference) is to remove the bi-directional fields from
Cell
and go with option 3. Which seems to be were we landed in #409, but the bi-directional fields were left hanging.
I don't object to this, we actually don't store them to avoid consistency problems:
curl -d '{"size":1}' https://covid19-1.ireceptor.org/airr/v1/cell
{
"cell_id": "641e1abd42e3dd9c961f2d8b",
"rearrangements": null,
"receptors": null,
"repertoire_id": "PRJCA002413-ERS1-IG-CELL",
"data_processing_id": "PRJCA002413-ERS1",
"expression_study_method": "single-cell transcriptome",
"expression_raw_doi": null,
"expression_index": null,
"virtual_pairing": false,
"ir_annotation_set_metadata_id_cell": "635af09e3891ed552fe4dca1",
"adc_annotation_cell_id": "AAACCTGCACCGATAT-1",
"sample_processing_id": "PRJCA002413-ERS1-IG-CELL",
"ir_created_at_cell": "2023-03-24T21:48:45.792352+00:00",
"ir_updated_at_cell": "2023-03-24T21:48:45.792352+00:00"
}
I don't run a repository, so I guess it's no sweat of my back, but that's a turrrrible way to store RNAseq data and not at all what I thought
CellExpression
was for.
The data isn't necessarily stored in that way in the repository, that is how the /expression endpoint gives it to you if you ask for a JSON response (which is the default). We store it as a big flat mongo document that is very fast to query if indexed correctly (which we are pretty good at). Maybe we run into challenges when we have 5B expression values, but for now it is not a challenge at all...
Although we have none of these at the moment, the API could trivially produce TSV files, matrix files, h5ad, etc. We discussed that here: https://github.com/airr-community/airr-standards/issues/409
The data is there, the data is queryable, but we don't have great use cases, and our discussions with 10X suggested that implementing any specific h5ad type of file would be a pain to support and by definition niche. Maybe that is changing but that is where we are at today.
I don't object to this, we actually don't store them to avoid consistency problems:
Perfect illustration of the problem. Nulling the field when you could populate the data implies the schema is flawed. You're probably making the correct choice, and it is technically legal to null those required fields, but it's also bypassing the model implied by the Cell
schema because the schema isn't working for you.
I don't object to this, we actually don't store them to avoid consistency problems:
Perfect illustration of the problem. Nulling the field when you could populate the data implies the schema is flawed.
Yes, I don't necessarily disagree.
You're probably making the correct choice, and it is technically legal to null those required fields, but it's also bypassing the model implied by the
Cell
schema because the schema isn't working for you.
The schema is working for us just fine (see screenshot below), and as you say we are following the schema as it is specified. We just don't use that feature.
Almost all of the issues we raise are things that we run into when we are trying to use the specification "in real life". But we have to pick our battles as to which issues we raise. Having a field that someone at some point expressed a desire to have that we don't need isn't going to be a battle that I pick to debate over. We have enough issues as it is and we raise issues that come up when things don't work for us 8-)
I can't say that the use case for rearrangements
and receptors
arrays isn't important for something. We added it at some point for a reason. We decided (maybe not intentionally) not to remove it when we removed expression_tabular
. So for me it isn't a burning issue, and I can see why it might be useful to have. If we decide to remove it that is fine too. I don't necessarily mind the ability to have that two way relationship if it is desired, but I can see it can be confusing.
The data is there, the data is queryable, but we don't have great use cases, and our discussions with 10X suggested that implementing any specific h5ad type of file would be a pain to support and by definition niche. Maybe that is changing but that is where we are at today.
With scirpy's support of the AIRR TSV format, it is likely that this might be the best way to return GEX data:
https://scirpy.scverse.org/en/latest/glossary.html#term-AIRR
But the API can't do that yet. We actually have a AIRR JSON Expression to h5ad converter (yes its scary and slow and hacky) 8-)
So if you download Cell/Expression data from the Gateway, you can use this to convert the data to h5ad files and then use those as input to tools like Conga and CellTypist (these are Apps we have built into the Gateway).
Here's some diagrams with what I think the implications are for options 1 and 3. For fields, minus = remove and plus = add.
I still don't understand how option 3 is different from what we currently have....
I still don't understand how option 3 is different from what we currently have....
Mostly because we currently have bi-directional links in Cell
. Though, I think there are assumptions of hierarchy that get us stuck in circles. If it's just me who is confused, well, then, that's super easy to fix (I hope!).
The +/- fields are what would change. I don't think I made that clear initially.
A discussion around the necessity and the architecture of the
Cell
object and its associated object recently arose in #705. Indeed the current relations and links betweenCell
,CellExpression
,CellReactivity
and other objects (incl.Receptor
andRearrangement
) appear overly complicated. Various solutions have been proposed, including:Cell
should become a container for otherCell*
objects, which would then not exists as independent top-level objects anymoreCell
should be completely removed and cells would only be represented in a data set via acell_id
in various other objects, but not as an object on its own.