Closed bcorrie closed 2 years ago
@bcorrie The comment for GeneExpression
says "Expression data is associated with a cell_id and the related repertoire_id and data_processing_id as cell_id is not guaranteed to be unique outside the data processing for a single repertoire" seems to conflict with have a /cell/{cell_id}
entry point. How are we going to keep the same semantics for /cell/{cell_id}
like we do for the repertoire and rearrangement entry points, where the ID uniquely identifies the object?
I assume we are going to need to do something similar whereby local files can have a weaker uniqueness constraint but that uniqueness needs to be stronger for the ADC. Do we consider it like sequence_id
for rearrangements where we allow the repository to assign its own? But then what do we do about the original cell_id
assigned by the user which might be important to maintain for some reason?
But then what do we do about the original
cell_id
assigned by the user which might be important to maintain for some reason?
I haven't given this terribly much thought, but off the top of my head, I can't think of such a case. At most, the depositor might want to know which ADC record corresponds to a particular local record.
I think we have to be very careful here... cell_id
in the Cell
object is the same cell_id
in the Rearrangement
and GenExpression
object. From a 10X perspective, this would be produced by the annotation tool, and it must maintain consistency as it is contained in multiple records. So I don't think we can allow this to get changed by the repository. Certainly the repository can't change it unless it changes it everywhere else in the repository...
I think if we need a repository wide identifier in this case we probably would need a repository specific field? This is what sequence_id
became, but in that case we agreed the tool/repository could over write it because it didn't perform any other function. This isn't true for cell_id
.
I believe the same is true of clone_id
now that I think of it, as it is also included in the Rearrangement
object. Since it is generated by an annotation tool, it will not be guaranteed to be unique except in the context of the data processing that produced it...
I think we have to be very careful here...
cell_id
in theCell
object is the samecell_id
in theRearrangement
andGenExpression
object.
@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...
I think we have to be very careful here...
cell_id
in theCell
object is the samecell_id
in theRearrangement
andGenExpression
object.@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...
Not really, and we could do that, but it seems like a bad idea to be changing a field that is used by an annotation tool to reference other entities internally, in particular if that field maps to a field in the AIRR schema that also refer to other entities...
That seems very different to me than overwriting a field that never gets used anywhere as in sequence_id
I am not arguing that we absolutely shouldn't, but I think we need to think about this a bit.
I think we have to be very careful here...
cell_id
in theCell
object is the samecell_id
in theRearrangement
andGenExpression
object.@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...
Not really, and we could do that, but it seems like a bad idea to be changing a field that is used by an annotation tool to reference other entities internally, in particular if that field maps to a field in the AIRR schema that also refer to other entities...
Yeah I'm agreeing with @bcorrie perspective here. I could definitely see there being many output files, e.g. from a 10x CellRanger run, which has the cell_id stuck in them, where those files aren't going to be loaded into the ADC. However, users might download them, use them in 10x's Loupe tool for example. Having to go through all those files to insure cell_id is consistent sounds like a maintenance headache, much better to leave them as is.
Along the lines of our PID work #347 #465 , maybe have a cell_pid
field which is the unique, Findable (FAIR), object identifier that is really only needed once the object is put into a public repository, and thus needs to be persistent.
Doing some work on trying to curate a full 10X study. The real nitty-gritty details only come out when you dive really deep!!!
It seems to me that if one is storing both rearrangements and cells from a single cell study (10X to be specific) then we want Cell.data_processing_id == Rearrangement.data_processing_id
for any Cell and Rearrangement object that was produced with a single 10X run. So the cell_id uniqueness criteria is really within a data_processing_id
, no???
For example cell_id in the 10X airr_rearrangements.tsv is talking about the same cell_id as in barcodes.tsv if and only if they come from the same DataProcessing and therefore they should have the same data_processing_id??? This makes sense because it is that common data processing that is creating those files where the cell_id actually means something between those files. Using cell_id across data processing runs with cell ranger just plain wrong...
In order to find all the cells from a 10X run I search /airr/v1/cells for a repertoire_id/data_processing_id
pair
In order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for a cell_id/repertoire_id/data_processing_id
and that would give me the paired chain rearrangements for that cell.
Can anyone poke holes in this logic?
It seems to me that if one is storing both rearrangements and cells from a single cell study (10X to be specific) then we want
Cell.data_processing_id == Rearrangement.data_processing_id
for any Cell and Rearrangement object that was produced with a single 10X run.
Yes, assuming they are produced together then they should have the same data_processing_id
. Though 10x tools don't really know anything about data_processing_id
so it's blank unless you assign it yourself.
So the cell_id uniqueness criteria is really within a
data_processing_id
, no???
Almost. Multiple repertoires might be processed in the same data processing, so you should also consider repertoire_id
. Actually, I would generally consider cell_id
unique within a repertoire_id
and only worry about data_processing_id
if I know that I'm mixing multiple data processings. The same point above applies, i.e. 10x tools don't really know anything about repertoire_id
so it's blank unless you assign it yourself.
In order to find all the cells from a 10X run I search /airr/v1/cells for a
repertoire_id/data_processing_id
pairIn order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for a
cell_id/repertoire_id/data_processing_id
and that would give me the paired chain rearrangements for that cell.
Well now this confuses things because up above I read that as being about local files, local identifiers, and 10x tools. Once you jump to the ADC, it throws in the question about PIDs (i.e. all the discussions over at #347 ). Let's make the assumption (which I know you are arguing against in #347 but play along for a sec :-D that when you load the data in the repository, you assign global identifiers for cell_id
. In that case, you can find all the rearrangements just by using cell_id
.
If you kept 10x's cell_id
when loading into the repository, yes the cell_id/repertoire_id/data_processing_id
combo is correct, assuming also you assigned repertoire_id
and data_processing_id
correctly yourself because the 10x tools don't know anything about them.
In order to find all the cells from a 10X run I search /airr/v1/cells for a
repertoire_id/data_processing_id
pair In order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for acell_id/repertoire_id/data_processing_id
and that would give me the paired chain rearrangements for that cell.Well now this confuses things because up above I read that as being about local files, local identifiers, and 10x tools. Once you jump to the ADC, it throws in the question about PIDs (i.e. all the discussions over at #347 ).
I don't think it confuses things for queries for a single repository. repertoire_id
is unique in a repository, data_processing_id
is unique within a Repertoire, and cell_id
is unique within a DataProcessing. So cell_id/repertoire_id/data_processing_id
on rearrangements will give me all of the rearrangements from that Cell, but no rearrangements from another Subjects/Samples/DataProcessings. So such a query should give me the paired chain rearrangements for that cell_id.
To make it globally unique in the entire ADC, the above is not sufficient, but it is sufficient at the repository level - which is what we are trying to ensure at the moment... 8-)
To make it globally unique in the entire ADC, the above is not sufficient, but it is sufficient at the repository level - which is what we are trying to ensure at the moment... 8-)
I understand. That's why I offered the two possibilities.
Now what about the converse problem. Let's say you had a rearrangement record, which contains a cell_id
, and you can retrieve the cell information with /airr/v1/cell/{cell_id}
. Or can you?
Added an ADC specific cell_id to address most (all) of the controversy in #409.
As per discussion at AIRR Standards, we think we can merge this... Objections???
@bcorrie Calling the object GeneExpression
is specific for a certain type of measurement. My understanding from the 2021-12 call was that multiple measurement types would be supported by a single object.
Change this to CellExpression - captures the ability to generalize the type of information that is measured.
@bussec can you add an example for property - currently we have:
property:
$ref: '#/Ontology'
description: Name of the property observed, typically a identifier from a canonical resource such as Ensembl and its label (e.g. ENSG00000275747, IGHV3-79)
title: Gene name
nullable: true
example:
id: ENSG:ENSG00000275747
label: IGHV3-79
x-airr:
miairr: defined
adc-query-support: true
format: ontology
name: Name of the property
@bcorrie
id:
ABREG:1236456
label:
Purified anti-mouse/rat/human CD27 antibody
-
curie_prefix: ABREG
iri_prefix:
- "http://antibodyregistry.org/AB_"
@bussec do you want to re-review the above changes? 8-)
Needed to update GeneExpression -> CellExpression in the APIs Moved the adc custom query to be an adc_ field in the spec rather than in the API similar to what we have for adc_update_date etc.
Closes #409
Please discuss as to whether this captures our current understanding.