airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Updates to cell object as per #409 #574

Closed bcorrie closed 2 years ago

bcorrie commented 2 years ago

Closes #409

Please discuss as to whether this captures our current understanding.

schristley commented 2 years ago

@bcorrie The comment for GeneExpression says "Expression data is associated with a cell_id and the related repertoire_id and data_processing_id as cell_id is not guaranteed to be unique outside the data processing for a single repertoire" seems to conflict with have a /cell/{cell_id} entry point. How are we going to keep the same semantics for /cell/{cell_id} like we do for the repertoire and rearrangement entry points, where the ID uniquely identifies the object?

I assume we are going to need to do something similar whereby local files can have a weaker uniqueness constraint but that uniqueness needs to be stronger for the ADC. Do we consider it like sequence_id for rearrangements where we allow the repository to assign its own? But then what do we do about the original cell_id assigned by the user which might be important to maintain for some reason?

scharch commented 2 years ago

But then what do we do about the original cell_id assigned by the user which might be important to maintain for some reason?

I haven't given this terribly much thought, but off the top of my head, I can't think of such a case. At most, the depositor might want to know which ADC record corresponds to a particular local record.

bcorrie commented 2 years ago

I think we have to be very careful here... cell_id in the Cell object is the same cell_id in the Rearrangement and GenExpression object. From a 10X perspective, this would be produced by the annotation tool, and it must maintain consistency as it is contained in multiple records. So I don't think we can allow this to get changed by the repository. Certainly the repository can't change it unless it changes it everywhere else in the repository...

I think if we need a repository wide identifier in this case we probably would need a repository specific field? This is what sequence_id became, but in that case we agreed the tool/repository could over write it because it didn't perform any other function. This isn't true for cell_id.

I believe the same is true of clone_id now that I think of it, as it is also included in the Rearrangement object. Since it is generated by an annotation tool, it will not be guaranteed to be unique except in the context of the data processing that produced it...

scharch commented 2 years ago

I think we have to be very careful here... cell_id in the Cell object is the same cell_id in the Rearrangement and GenExpression object.

@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...

bcorrie commented 2 years ago

I think we have to be very careful here... cell_id in the Cell object is the same cell_id in the Rearrangement and GenExpression object.

@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...

Not really, and we could do that, but it seems like a bad idea to be changing a field that is used by an annotation tool to reference other entities internally, in particular if that field maps to a field in the AIRR schema that also refer to other entities...

That seems very different to me than overwriting a field that never gets used anywhere as in sequence_id

I am not arguing that we absolutely shouldn't, but I think we need to think about this a bit.

schristley commented 2 years ago

I think we have to be very careful here... cell_id in the Cell object is the same cell_id in the Rearrangement and GenExpression object.

@bcorrie true, I just took it for granted that if the field was being changed it would get updated simultaneously in all appropriate places. Maybe that's naive of me...

Not really, and we could do that, but it seems like a bad idea to be changing a field that is used by an annotation tool to reference other entities internally, in particular if that field maps to a field in the AIRR schema that also refer to other entities...

Yeah I'm agreeing with @bcorrie perspective here. I could definitely see there being many output files, e.g. from a 10x CellRanger run, which has the cell_id stuck in them, where those files aren't going to be loaded into the ADC. However, users might download them, use them in 10x's Loupe tool for example. Having to go through all those files to insure cell_id is consistent sounds like a maintenance headache, much better to leave them as is.

Along the lines of our PID work #347 #465 , maybe have a cell_pid field which is the unique, Findable (FAIR), object identifier that is really only needed once the object is put into a public repository, and thus needs to be persistent.

bcorrie commented 2 years ago

Doing some work on trying to curate a full 10X study. The real nitty-gritty details only come out when you dive really deep!!!

It seems to me that if one is storing both rearrangements and cells from a single cell study (10X to be specific) then we want Cell.data_processing_id == Rearrangement.data_processing_id for any Cell and Rearrangement object that was produced with a single 10X run. So the cell_id uniqueness criteria is really within a data_processing_id, no???

For example cell_id in the 10X airr_rearrangements.tsv is talking about the same cell_id as in barcodes.tsv if and only if they come from the same DataProcessing and therefore they should have the same data_processing_id??? This makes sense because it is that common data processing that is creating those files where the cell_id actually means something between those files. Using cell_id across data processing runs with cell ranger just plain wrong...

In order to find all the cells from a 10X run I search /airr/v1/cells for a repertoire_id/data_processing_id pair

In order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for a cell_id/repertoire_id/data_processing_id and that would give me the paired chain rearrangements for that cell.

Can anyone poke holes in this logic?

schristley commented 2 years ago

It seems to me that if one is storing both rearrangements and cells from a single cell study (10X to be specific) then we want Cell.data_processing_id == Rearrangement.data_processing_id for any Cell and Rearrangement object that was produced with a single 10X run.

Yes, assuming they are produced together then they should have the same data_processing_id. Though 10x tools don't really know anything about data_processing_id so it's blank unless you assign it yourself.

So the cell_id uniqueness criteria is really within a data_processing_id, no???

Almost. Multiple repertoires might be processed in the same data processing, so you should also consider repertoire_id. Actually, I would generally consider cell_id unique within a repertoire_id and only worry about data_processing_id if I know that I'm mixing multiple data processings. The same point above applies, i.e. 10x tools don't really know anything about repertoire_id so it's blank unless you assign it yourself.

In order to find all the cells from a 10X run I search /airr/v1/cells for a repertoire_id/data_processing_id pair

In order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for a cell_id/repertoire_id/data_processing_id and that would give me the paired chain rearrangements for that cell.

Well now this confuses things because up above I read that as being about local files, local identifiers, and 10x tools. Once you jump to the ADC, it throws in the question about PIDs (i.e. all the discussions over at #347 ). Let's make the assumption (which I know you are arguing against in #347 but play along for a sec :-D that when you load the data in the repository, you assign global identifiers for cell_id. In that case, you can find all the rearrangements just by using cell_id.

If you kept 10x's cell_id when loading into the repository, yes the cell_id/repertoire_id/data_processing_id combo is correct, assuming also you assigned repertoire_id and data_processing_id correctly yourself because the 10x tools don't know anything about them.

bcorrie commented 2 years ago

In order to find all the cells from a 10X run I search /airr/v1/cells for a repertoire_id/data_processing_id pair In order to find all the rearrangements from a single cell with a specific cell_id from the above list of cells, I search /airr/v1/rearrangement with for a cell_id/repertoire_id/data_processing_id and that would give me the paired chain rearrangements for that cell.

Well now this confuses things because up above I read that as being about local files, local identifiers, and 10x tools. Once you jump to the ADC, it throws in the question about PIDs (i.e. all the discussions over at #347 ).

I don't think it confuses things for queries for a single repository. repertoire_id is unique in a repository, data_processing_id is unique within a Repertoire, and cell_id is unique within a DataProcessing. So cell_id/repertoire_id/data_processing_id on rearrangements will give me all of the rearrangements from that Cell, but no rearrangements from another Subjects/Samples/DataProcessings. So such a query should give me the paired chain rearrangements for that cell_id.

To make it globally unique in the entire ADC, the above is not sufficient, but it is sufficient at the repository level - which is what we are trying to ensure at the moment... 8-)

schristley commented 2 years ago

To make it globally unique in the entire ADC, the above is not sufficient, but it is sufficient at the repository level - which is what we are trying to ensure at the moment... 8-)

I understand. That's why I offered the two possibilities.

Now what about the converse problem. Let's say you had a rearrangement record, which contains a cell_id, and you can retrieve the cell information with /airr/v1/cell/{cell_id}. Or can you?

bcorrie commented 2 years ago

Added an ADC specific cell_id to address most (all) of the controversy in #409.

As per discussion at AIRR Standards, we think we can merge this... Objections???

bussec commented 2 years ago

@bcorrie Calling the object GeneExpression is specific for a certain type of measurement. My understanding from the 2021-12 call was that multiple measurement types would be supported by a single object.

bcorrie commented 2 years ago

Change this to CellExpression - captures the ability to generalize the type of information that is measured.

bcorrie commented 2 years ago

@bussec can you add an example for property - currently we have:

        property:
            $ref: '#/Ontology'
            description: Name of the property observed, typically a identifier from a canonical resource such as Ensembl and its label (e.g. ENSG00000275747, IGHV3-79)
            title: Gene name
            nullable: true
            example:
                id: ENSG:ENSG00000275747
                label: IGHV3-79
            x-airr:
                miairr: defined
                adc-query-support: true
                format: ontology
                name: Name of the property
bussec commented 2 years ago

@bcorrie

bcorrie commented 2 years ago

@bussec do you want to re-review the above changes? 8-)

Needed to update GeneExpression -> CellExpression in the APIs Moved the adc custom query to be an adc_ field in the spec rather than in the API similar to what we have for adc_update_date etc.