airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Cell object specifications - ScXT #320

Closed sharikrish closed 1 year ago

sharikrish commented 4 years ago

Followed by a discussion with @franasa , @bussec; we have come up with the below schema for cell object.

Cell:
    discriminator: AIRR
    type: object
    required:
        - cell_id 
        - rearrangements  
        - virtual
    properties:
        cell_id:
            type: string
            description: >
                Identifier defining the cell of origin for the query sequence.
            example: W06_046_091
            x-airr:
                miairr: false
                required: true
                nullable: false
                adc-api-optional: false
                name: Cell index
        rearrangements:
            type: array
            description: >
                  List of rearrangement identifiers defined for the Rearrangement object
            items:
                type: string
            example: [id1, id2] 
            x-airr:
                miairr: false
                required: true
                nullable: true
                adc-api-optional: false
                name: 
        virtual:
            type: boolean
            description: >
                boolean to indicate if pairing was inferred.
            x-airr:
                miairr: false
                required: true
                nullable: false
                adc-api-optional: true

From an API perspective we have implemented cell endpoint which provides all cell associated properties such as rearrangementsvirtual flag etc as per the discussion in #273

HTTP method URL ACTION
GET, POST http://[hostname]/api/cells Retrieve list of cells
GET, POST http://[hostname]/api/cell/[cell_id] Retrieve a cell

For instance, the cell endpoint would return a json upon a query and it is represented as below

curl -i -H 'Content-Type: application/json' -X POST 'http://127.0.0.1:5000/api/cells?cell_id=1086

{
  "Cells": [
    {
      "cell_id": "1086", 
      "rearrangements": ["507","678"], 
      "virtual": false
    }
  ]
}

Some of the questions that we would like to address are as follows:

  1. In the Cell object specifications miairr is set to be false which is different from the cell_id properties defined in the rearrangement object. In addition, it is obvious for single cell experiments to have cell_id associated with them and hence it cannot have a null value. Hence, it might be logical to have extension properties for cell_id similar to those discussed for race or ethnicity etc. in Human population genetics extension #318 to trigger the properties defined in the extension if study has a cell_id. The question would be how to centralize cell_id properties with the original yaml file?

  2. rearrangements defined in Cell object would contain a list of rearrangement_id associated with the cell_id. From a single cell context, a cell can either have a single or sometimes none and multiple rearrangements. For instance, when one queries for a cell_id, all the rearrangements associated with cell_id needs to be retrieved and one of the best ways to define rearrangements would be as a list containing all rearrangement_id as defined for clone object #294 [Line:2359]

scharch commented 4 years ago

Hence, it might be logical to have extension properties for cell_id similar to those discussed for race or ethnicity etc.

@srlak These should be properties of the sample, not of the cell. Similarly, rearrangement_id alone isn't enough to recover the sequences. To solve both, Cell should include required, non-nullable fields for repertoire_id and sample_id.

bussec commented 4 years ago

Hence, it might be logical to have extension properties for cell_id similar to those discussed for race or ethnicity etc.

@srlak These should be properties of the sample, not of the cell.

@scharch (assuming that this comment referred to race and ethnicity): They are actually at the Subject level, but IMO were just there as an example for the HPG-XT (#318) and the (currently unresolved) way how to determine whether an XT is active or not.

Similarly, rearrangement_id alone isn't enough to recover the sequences. To solve both, Cell should include required, non-nullable fields for repertoire_id and sample_id.

Because rearrangement_id is not a globally unique identifier? No objections against adding the other fields, which then would be:

Correct?

scharch commented 4 years ago

@bussec Correct.

bcorrie commented 4 years ago

@bussec @srlak in the original single cell extension proposal here:

https://github.com/airr-community/airr-standards/blob/single_cell_ext/docs/miairr/miairr_single_cell_extension.rst

There were fields around DOIs, keywords, and expression value/expression marker data...

Should we have place holders for this in this spec?

bcorrie commented 4 years ago

One question as to the rearrangements array in the Cell object. If each rearrangement has a cell_id field, and this presumably matches the cell_id in the Cell object, do we need to have a list of rearrangements at the Cell level? Now each Cell presumably would only ever be linked with a small number of rearrangements so maybe this is not the end of the world from a data size perspective, but linking both ways can lead to data consistency issues...

We can accomplish this link by having one or the other, but we don't need both... We might want both of efficiency, but we don't need them...

franasa commented 4 years ago

What if the Rearrangement object instead of having a duplicate of the cell_id inherits the whole defined Cell object?, this would be then also triggered only when the single cell extension is required. Does this make sense and could solve the potential consistency issues?. It's true that for small lists of rearrangement_ids it might not make a big difference but as far I understand from @srlak's advances on the API (correct me if I am wrong) it gets more tricky when we start defining with how to include other single-cell specifics such as flow cytometry markers

schristley commented 4 years ago

sample_id: 1-to-1 reference, as an individual cell can only come from a single tube

sample_id or sample_processing_id? sample_id refers to just the MiAIRR sample object excluding cell processing down to sequencing, while sample_processing_id refers to the set of objects from sample down to sequencing.

schristley commented 4 years ago

If each rearrangement has a cell_id field, and this presumably matches the cell_id in the Cell object, do we need to have a list of rearrangements at the Cell level?

@bussec I guess we need to deprecate cell_id in rearrangements?

@srlak Should there be a receptors array with the list of receptors in the cell?

javh commented 4 years ago

The Cell object isn't compatible with storing Rearrangement records in a TSV, so you would still need the cell_id in Rearrangement for both the flat/simple representation of single-cell and as a way to link from Rearrangement to Cell records (outside of a json/yaml/database representation of Rearrangement records).

schristley commented 4 years ago

the cell_id in Rearrangement for both the flat/simple representation of single-cell and as a way to link from Rearrangement to Cell records.

I was thinking we wanted to deprecate cell_id in rearrangements because it allows that rearrangement to only be associated with one cell. Based upon @bussec comment, it seems like we need to support a n-to-n relationship.

javh commented 4 years ago

I was thinking we wanted to deprecate cell_id in rearrangements because it allows that rearrangement to only be associated with one cell. Based upon @bussec comment, it seems like we need to support a n-to-n relationship.

I don't think we want to do that. That would mean dropping single-cell support from the TSV (and other simple tabular representations). We need to support an n-to-n relationship, but not required it, because n-to-n is not the most common use case.

I'm viewing Cell like Alignment - a solution for when a single reference alignment result (which is what Rearrangement supports) isn't sufficient.

sharikrish commented 4 years ago
* `repertoire_id`: 1-to-n reference, as our repertoire definition allows overlapping repertoires (e.g. `IgG class-switched B cells` and `CD27+ B cells`).

@bussec @scharch Would this then mean that cell_id is associated with multiple repertoire_id which then points to the associated meta-data for each repertoire_id?

There were fields around DOIs, keywords, and expression value/expression marker data...

Should we have place holders for this in this spec?

@bcorrie : yes you are right and we will include it but before we need to sort out cell_id mapping and linkage (explained below). Excluding that we have updated the spec as below. In addition to this, consider a case where we have two repertoires from a single sample that has different types of Bcells (eg. IgG+Bcell and CD27+) and this would have same rearrangements that are just duplicated with different rearrangement_id and would then be linked to cell_id with their expression data etc. and the question is how do we link all of them to Cell Object? Referring to #181 . and we have updated the spec below.

Cell:
    discriminator: AIRR
    type: object
    required:
        - cell_id #redefined cell_id > how to centralize it in the yaml 
        - rearrangements  
        - virtual
    properties:
        cell_id:
            type: string
            description: >
                Identifier defining the cell of origin for the query sequence.
            example: W06_046_091
            x-airr:
                miairr: false
                required: true
                nullable: false
                adc-api-optional: false
                name: Cell index
        rearrangements:
            type: array
            description: >
                  Array of rearrangement identifiers defined for the Rearrangement object
            items:
                type: string
            example: [id1, id2] #empty vs NULL? 
            x-airr:
                miairr: false
                required: true
                nullable: true
                adc-api-optional: false
        raw_data:
            type:object
            description:
            properties:
                study_method: 
                    type: string
                    enum: 
                       flow cytometry
                       single-cell transcriptome
                    description: >
                       keyword describing the methodology used to assess expression. This values for this field MUST come from a controlled vocabulary
                doi:
                    type: string
                    description: >
                      DOI of raw data set containing the current event
                index:
                    type: string
                    description: >
                      Index addressing the current event within the raw data set.
        expression:
            type: object
            description: > 
                Expression definitions for single-cell
            properties:
                expression_maker: 
                    type: string
                    description: >
                       standardized designation of the transcript or epitope
                    example: CD27
                expression_value:
                    type: integer
                    description: >
                       transformed and normalized expression level.
                    example: 14567
         virtual:
            type: boolean
            description: >
                boolean to indicate if pairing was inferred.
            x-airr:
                miairr: false
                required: true
                nullable: false # assuming only done for sc experiments, otherwise does not exist
                adc-api-optional: true

If each rearrangement has a cell_id field, and this presumably matches the cell_id in the Cell object, do we need to have a list of rearrangements at the Cell level?

@bcorrie : Yes. This makes sense. A naive question is cell_id is not a required field in Rearrangement object and shouldn't this be required field if it is referring to a single-cell study?

sharikrish commented 4 years ago

@srlak Should there be a receptors array with the list of receptors in the cell?

@schristley Indeed, we should then have a list of receptor_id in the cell.

scharch commented 4 years ago
* `repertoire_id`: 1-to-n reference, as our repertoire definition allows overlapping repertoires (e.g. `IgG class-switched B cells` and `CD27+ B cells`).

@bussec @scharch Would this then mean that cell_id is associated with multiple repertoire_id which then points to the associated meta-data for each repertoire_id?

Oh, ugh, this could actually get really messy: a rearrangement_id is supposed to be universally unique, but we haven't (AFAIK) allowed for a single Rearrangement object to be linked to multiple Repertoires (ie Rearrangement includes only a single repertoire_id field). If so, then a Cell that is linked to multiple Repertoires might also require rearrangement_ids within each one?? I guess the uniqueness would at least mean that we don't have to explicitly specify which rearrangement_ids go with which Repertoire...

Overall, I think it would be better to restrict each Cell to a single Repertoire, even if that might mean a proliferation of duplicate Cell objects --it's a least consistent with how we expect Rearrangements to be handled in a similar case.

scharch commented 4 years ago

@bcorrie : Yes. This makes sense. A naive question is cell_id is not a required field in Rearrangement object and shouldn't this be required field if it is referring to a single-cell study?

@srlak in principle, sure, but I think we want to avoid "contigently-required" fields...

scharch commented 4 years ago

@srlak raw_data and (especially) expression seem like they need to be arrays.

sharikrish commented 4 years ago

Overall, I think it would be better to restrict each Cell to a single Repertoire, even if that might mean a proliferation of duplicate Cell objects --it's a least consistent with how we expect Rearrangements to be handled in a similar case.

Lets say we want to evaluate CD27 and IgG expression using flow-cytometry; four regions were delineated for CD27+ ( let say A), CD27+IgG+ (B), CD27-IgG+(C) and CD27-IgG-(D). Total RNA is extracted from sorted population [A,B and C] followed by sequencing. In order to screen for CD27+ from the sequenced data, population from A and B are pooled and similarly B and C are pooled together to screen for IgG+.

Assuming from the current Repertoire definitions where rearrangement_ids are globally unique and each Cell is restricted to single Repertoire.

In this case, we assume A is associated with two rearrangement_ids [ex : 101, 102], B with two rearrangement_ids [201, 202] and C with two rearrangement_ids [301,302]. However, for simplicity we assign just two rearrangement_ids for each but it could be even more than 2. In this case CD27+ will end up having four sets of rearrangement_ids say [401, 402, 403, 405] that are same as [101,102,201,202] because rearrangement_ids should be unique. Similarly IgG+ will be a combination of [201,202,301,303] but with just different identifiers say [501,502,503, 504]. Is this assumption correct?

Again for simplicity for now, lets consider rearrangement_ids are same as cell_id. In this study we will have only [101,102,201,202,301,302] and 401 and 501 will have 101 as cell_id as they come from that same cell. Similarly, 402 and 502 will have 201 cell_id so on and so forth. Is this correct?

However a potential problem with these definitions would then be if a user queries for a particular cell_id lets say 101 and the rearrangements associated with 101 are 101,401,501 which are triplicates containing exact same redundant information; Wouldn't this be like multiple records holding same information leading to artificially inflating the amount of information we provide per cell_id?

scharch commented 4 years ago

@srlak I think that's exactly my point, yes, but I want to rework your nomenclature a bit to make sure we are really talking about the same thing:

Population A (CD27+IgG-) contains cells (not cell_ids) A1, A2... and similarly for populations B, C, and D. Cell A1 contains rearrangements (not rearrangement_ids) A1H and A1L. Similarly for cells A2, B1, and so on. Repertoire R1 is defined as CD27+ (populations A+B) and repertoire R2 is defined as IgG+ (so B+C). Thus cells B1, B2... and rearrangements B1H, B1L, B2H, B2L... are included in both repertoires.

Under the current Rearrangement schema, there are therefore otherwise identical Rearrangement objects with rearrangement_ids B1H.R1 and B1H.R2 (or B1L.R1/B1L.R2, B2H.R1/B2H.R2 etc etc). However, under the proposed Cell schema, there is a single Cell object with cell_id=B1.global and repertoire_id=[R1,R2]. What, then, should be the value of rearrangement_id for this Cell? It seems like it would have to be [B1H.R1,B1H.R2,B1L.R1,B1L.R2], which, as you point out is redundant and potentially confusing. I also think there is probably potential for inefficiencies in data retrieval, since you can't know ahead of time which rearrangement_ids are in which Repertoire (at least without further complicating the Cell data structure).

The most elegant solution, of course, would be have to a single Rearrangement object for B1H (etc) with rearrangement_id=B1H.global and repertoire_id=[R1,R2]. I worry, though, that that would irreparably break the TSV format. (Although maybe not, if repertoire_id is mostly considered an ADC API field --@javh?) I could also see it causing problems with the value of repertoire_id needing to be updated for potentially millions of records if I later create an new Repertoire R3 to look at all four populations together (and then maybe R4, R5, and R6 when I decide I need more data and sort another vial from the same donor and time point). Even if the update itself isn't a problem, it would probably be very hard to keep track of/know when an update is in order.

Barring that, then, my proposal is to instead emend the Cell schema so that we have "duplicate" Cell objects in the same way that there are "duplicate" Rearrangements. So you would have

{ "cell_id":"B1.R1", "repertoire_id":"R1", "rearrangement_id":["B1H.R1","B1L.R1"]}
{ "cell_id":"B1.R2", "repertoire_id":"R2", "rearrangement_id":["B1H.R2","B1L.R2"]}

I think that takes care of most of the problems. It is true that if you queried the API for (say) all IgG+ Cells that the response would contain these duplicates, but that's already true if you ask for eg all VH1-69-JH5 Rearrangements --you would get both B1H.R1 and B1H.R2. The sample_processing_ids can be compared as a partial filter, but I'm not sure there's a great answer.

schristley commented 4 years ago

@srlak @scharch One thing that I've been trying to do with thinking about these (clone, cell, receptor, etc) schema enhancements is avoiding solutions that require duplicating rearrangements. The rearrangement data is large, and duplicating that data (that is, all annotations values are identical except for a field like cell_id or clone_id) is a waste of time and space. Now in some cases, this duplication is created by the researcher, e.g. one repertoire contains sequencing data A while a second repertoire contains sequencing data A+B, but that's their decision and I hope we can avoid "requiring/forcing" users to create duplication.

I currently don't see why a cell needs to be restricted to a single repertoire, but let me walk through this analysis first to see where it leads. Let's stick with the four biological populations (A, B, C, D). Now here you have the choices:

  1. Do pooling at the biological level, combine tubes
  2. Do pooling at the analysis level, combine repertoires

Sounds like we are mainly talking about 2 (otherwise why sort just to re-combine). That means we sequence A, B, C, D separately. Now we have choices for the repertoires:

  1. Create repertoire for each population, pool by creating a list (set) of repertoires: [A, B], [B, C].
  2. Create repertoire for each population, pool by creating new repertoires which includes multiple populations: A+B, B+C.

Hopefully you can see that with 1, there is no duplication of rearrangements. There are always 4 repertoires for the 4 populations, analysis is done on a single repertoire or a list of repertoires. With 2, there is the possibility of duplicate rearrangements because multiple repertoires reference the same sequence data.

I don't consider either 1 or 2 to be right or wrong, as I think you can come up with valid use cases for both scenarios, but going back to my first paragraph, I prefer 1 over 2 because it represents the same concept (pooling data) but doesn't entail creating extra rearrangement records in order to implement that pooling/grouping.

Now, let's consider what we actually require for cells. In the biology, a cell is a single entity and should in theory be represented with a single inferred cell (cell_id) though lots of things (experimental and/or computational) may break that 1-to-1 relationship. An open question is how to handle cells that are found in multiple repertoires, e.g. repertoires A, [A, B] or A+B.

  1. Can cells found in multiple repertoires have the same cell_id?
  2. Can cells found in multiple repertoires have different cell_id?

This question is irrespective of how we pool repertoires. Option 4 is an easy yes, this is where the "cell inference algorithm" treats the repertoires as independent, any correspondence in cell_id would just be coincidence. Option 3 is a harder yes, as it requires the "cell inference algorithm" to know about those multiple repertoires (and possibly process them together) to insure the same cell_id is used across them for the same cell.

Note that option 3 is pretty similar to clone argument I made, i.e. the desire to track a clone across multiple repertoires, which is facilitated by having the same clone_id. Presumably there might be a similar desire to track a cell across multiple repertoires.

These options imply different things for the Cell schema. Option 4 is easier, because each cell_id is unique, the Cell schema only needs to store information for a single data processing run of the "cell inference algorithm". I think the original Cell schema above satisfies this.

Option 3 is harder because the single Cell schema with that single cell_id must store information about (possibly) multiple data processing runs of the "cell inference algorithm". Here it is important to know which repertoires where processed together. Let's say a cell is in repertoire A, and it would also be in pooled repertoire A+B and [A, B]. If the Cell object just listed a set of repertoires, that wouldn't inherently give processing information about which rearrangements where used. I think we need to determine if Option 3 is important enough that we design the schema to handle it?

Finally, back to the pooling/grouping options. For option 4, if pool option 1 is used, then there is no duplication of rearrangements, so multiple cells may point to the same rearrangement. But of course, the rearrangement's cell_id cannot point back. If pool option 2 is used, then each repertoire has its own set of rearrangements, and thus multiple cells point to different rearrangements. In this case, the rearrangement's cell_id can point back but only because we created all of the extra rearrangement records.

Okay, after walking through this, I feel that users can actually use either pool option 1 or pool option 2 with the schema. I still prefer that we encourage users to use option 1, and I don't see any reason why we need to restrict a cell to a single repertoire.

scharch commented 4 years ago

@schristley In your example, I was proposing to answer option 3 as "no" under the assumption that pool option 2 would be the default, making implementation of 3 hard. I can see the merits of encouraging pool option 1, but I only just realized that it's already in the documentation ...I wonder how many people read carefully past "Repertoire 1-to-n with Sample" at the beginning of the line.

Is there a case under pooling option 1 in which a Sample would be included in multiple Repertoires (as opposed to a list of Repertoires)? If not, then the answer to 3 is still no, since a Cell must by definition be associated with a single Sample. (As opposed to a Receptor, which could be associated with multiple Cells both within and across Repertoires...)

schristley commented 4 years ago

I can see the merits of encouraging pool option 1, but I only just realized that it's already in the documentation. I wonder how many people read carefully past "Repertoire 1-to-n with Sample" at the beginning of the line.

@scharch Do you mean pool option 2? I consider that to be more like pool option 2. We don't have anything in the AIRR Data Model currently that explicitly supports pool option 1, though the topic was discussed a little with the clones schema. I consider pool option 1 to be like defining repertoire groups (control, treatment, etc) for intra- and inter-group comparisons.

But I think I understand your point, people not reading carefully may assume that array is the way to put "samples" together for comparison purposes. Even worse, a tool might write code that solidifies that assumption, causing endless confusion. I think if we had a standard way to define repertoire groups (i.e. pool option 1) that might eliminate that confusion.

And even so, when reading it, I realize that it is still imprecise. Using the word Sample isn't exactly correct, it should probably be SampleProcessing to signify the whole sequence of steps from Sample to SequencingRun.

Is there a case under pooling option 1 in which a Sample would be included in multiple Repertoires (as opposed to a list of Repertoires)?

There is but I think it would only be a case where the researcher explicitly does that, versus where the AIRR Data Model requires that. The simplest use case I can imagine is when you have replicates, say you pull 3 aliquots from the same tube and sequence each separately. You might have 3 repertoires for those 3 replicates so you can analyze them separately and compare, but then you might put those 3 replicates together to get a more complete "repertoire" for other analysis. The combined repertoire will duplicate rearrangements, and a cell that resides in a sample would appear in two repertoires.

javh commented 4 years ago

I'm a bit behind on this, but a couple earlier things...

(1) I didn't notice the raw_data and expression bits earlier. What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.

However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a gene x cell sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.

(2) A rearrangement is supposed to be an observation. Whether or not you have observations that qualify as duplicate, by whatever criteria, is something I think we should avoid trying to design around. You can algorithmically collapse duplicate entries then assign those collapsed observations a new rearrangement_id to reduce storage requirements. Trying to map each Rearrangement record to multiple Repertoires seems like a rabbit hole.

scharch commented 4 years ago

What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.

However, is the intent is to store a large number of features, then I don't think this will work

Hmm I was thinking like you sorted on a panel of 4-12 markers and you want to store the MFI for each probe for each cell. But yeah, for something like transcriptome data it should link out to a DOI. Not sure exactly where to draw the line...

bcorrie commented 4 years ago

(1) I didn't notice the raw_data and expression bits earlier. What's the intent here? If this is just meant to store a small number of genes or surface markers, then I think an array (as @scharch mentioned) in this object is okay.

However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a gene x cell sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.

I think this is an important question... Perhaps we should create a separate issue for this discussion. I think we will be able to get some input from 10X on this in a couple of weeks (they are busy at the moment). The sparse matrix approach seems to me to be required at least as part of the solution (if we want to support platforms like 10X), probably referring to external data through a DOI (to large to store as part of the actual AIRR-seq data)???

javh commented 4 years ago

Yeah, it's probably a separate topic. The cartoon here is a good representation of the data structure typically used for analysis of RNA-seq, scRNA-seq, etc: https://www.bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html

Where "Samples" are cells. So repertoire data would be some sort of column data (sample/cell annotations). Not necessarily obeying 1-to-1 rules, because that can be worked around with the implementation as long as there is a mechanism to reduce the data into 1-to-1 Cell:Receptor relationships or slice 1-to-many out somehow.

sharikrish commented 4 years ago

However, is the intent is to store a large number of features, then I don't think this will work because you'll need to use a gene x cell sparse matrix to store and analyze that kind of data. That's deeply ingrained in scRNA-seq analysis tools. So maybe just links to the appropriate matrices that are keyed on cell_id.

Indeed. Storing it via sparse matrix seems to be a good approach and as @bcorrie mentioned the idea was to link to a DOI for external data.

schristley commented 4 years ago

Hey @bussec , we discussed this a little bit in the recent CRWG call, but I wanted to followup with some more detail.

One of the reasons for the design of Repertoire as a composite object was so that an ADC API query could return all of that relevant data in a single request. CRWG had initially thought about having endpoints like /study, /subject, /sample and so on, which you can see would mean the user would have to do many separate queries to get the same data. Plus it introduces problems with identifiers because subject_id, sample_id, and so on, as they not guaranteed to be unique...

We discussed that with these new objects (Cell, Receptor, Clone) that we can/should go with a normal-form design. The implication is that multiple query requests will need to be performed to get "all" the data. For example, let's take the query example above. As the cell object only holds the rearrangement identifiers, you would need to do additional URL requests to get the rearrangement data:

# individual requests
curl https://host/airr/v1/rearrangement/507
curl https://host/airr/v1/rearrangement/678

# or a combined request
curl --data '{"filters":{"op":"in","content":{"field":"rearrangement_id","value":["507","678"]}}}' https://host/airr/v1/rearrangement

Now I'm fine with this, and maybe you already realized this, so we are all good. The same applies for repertoire_id, receptor_id, etc.

When writing this I noticed something, how do we know where (i.e. the host) the rearrangements are stored so that we can do those additional requests?

bcorrie commented 4 years ago

When writing this I noticed something, how do we know where (i.e. the host) the rearrangements are stored so that we can do those additional requests?

Presumably if one does

curl https://host/airr/v1/cell

and gets back something like:

{ "Cells": [ { "cell_id": "1086", "rearrangements": ["507","678"], "virtual": false } ]}

One would go to the same host...

curl --data '{"filters":{"op":"in","content":{"field":"rearrangement_id","value":["507","678"]}}}' https://host/airr/v1/rearrangement
schristley commented 4 years ago

One would go to the same host...

I guess my question wasn't clear. What if the rearrangements are stored in another data repository? Does this mean that I'm not allowed to download a study from IPA, do some new cell analysis, then load that cell analysis into VDJServer? Or if I do, I have to duplicate the whole study into VDJServer so that links can be followed?

I was aware of this issue before and didn't give it much thought but I think it might be an issue in iReceptor+. If a user queries some data, which comes from multiple data repositories, then does some analysis, some downstream process (visualization, load/publish of results, etc) may need to follow identifiers. I can see how information about the original data repository might get lost.

sharikrish commented 4 years ago

We would like to discuss the the relation between different entities for cell schema.

relationship

Cell is n-to-1 with Repertoire: A direct/observed repertoire holds multiple cells and a cell can be only found in one direct/observed repertoire

Cell is n-to-n with Rearrangement: A rearrangement may be observed in multiple cells and a cell can contain multiple rearrangements

Cell is n-to-1 with Clone: A clone can be represented from multiple cells and a cell represent a single clone

Cell is n-to-n with Receptor: A cell would contain multiple receptors and a receptor can be present in multiple cells

These are the best possible entity relationship we could define and let us know if any discrepancies.

Link for complete documentation on cell object specification : https://docs.google.com/document/d/1vOOxk2-gvw8fKMs9MqSJ_M6a5_RBxz57I1Q3CvF9jwI/edit?usp=sharing

bcorrie commented 4 years ago

Cell is n-to-n with Rearrangement: A rearrangement may be observed in multiple cells and a cell can contain multiple rearrangements

I don't think this is correct is it. Remember that a rearrangement is an observed sequence from a specific experiment that is annotated in a certain way. Therefore, is a rearrangement not 1:1 with a Cell? That is a specific observation of a sequence must come from a specific cell and each rearrangement record has a single cell_id that identifies the cell from which it was observed???

Am I confused???

bcorrie commented 4 years ago

Sorry, I think that should be N to 1... That is a Cell may contain more than one rearrangement, but a rearrangement can only be observed from one Cell.

scharch commented 4 years ago

@bcorrie we have all sorts of use cases where raw reads are grouped and manipulated together in various ways (eg UMI consensus generation). It would be perfectly reasonable/valid to also collapse identical Rearrangements from multiple cells into a single line/object, resulting in the N-to_n relationship. See also #340:

  • Tools should not assume that sequence_id contains a value that references a sequence in the raw sequencing files.
sharikrish commented 4 years ago

Sorry, I think that should be N to 1... That is a Cell may contain more than one rearrangement, but a rearrangement can only be observed from one Cell.

I agree with this point that at an abstract level we first observer a cell have multiple rearrangements and each rearrangements are associated with a single cell. However, as @scharch mentioned when we group the reads and we do observe identical rearrangements from multiple cells and tats why it is N:N.

bcorrie commented 4 years ago

@bcorrie we have all sorts of use cases where raw reads are grouped and manipulated together in various ways (eg UMI consensus generation). It would be perfectly reasonable/valid to also collapse identical Rearrangements from multiple cells into a single line/object, resulting in the N-to_n relationship. See also #340:

I may be confused, but...

I would question whether that grouping is still a Rearrangement or not... That sounds like a different type of object that is grouping rearrangements logically across cells based on some "identity" criterion into something that has a complex N-N relationship. Would this group be from within the same Repertoire or could the grouping span Repertoires? And what would a typical "identity" criterion be (how do you determine if two Rearrangements are identical)?

As you say, we already represent "collapsed" rearrangements (e.g. merge reads by removing duplicates, building UMI consensus sequences, or aggregating clonotypes as per https://github.com/airr-community/airr-standards/issues/246#issuecomment-592728685) but this seems different. When you collapse to get a consensus sequence, you essentially throw away the lower level rearrangement information and replace it with a consensus rearrangement that represents N more "fundamental" rearrangements. You observed 1 thing N times, so you represent it as 1 thing with a count of N. This grouping makes sense because of the nature of how the sequencing is performed.

The grouping we are talking about here seems more biological to me (I know, it is scary when I start trying to talk biology). My understanding (which may be totally wrong) is that in this context you want to have a grouping of Rearrangements where a given Rearrangement (an annotation of a sequence) from a specific Repertoire is grouped with other Rearrangements from other Repertoires (or maybe just the same Repertoire???) based on some "identity" criteria. From my understanding, this isn't really a single Rearrangement that comes from N cells, but rather a conceptual object (a Chain?) that captures the essence of the rearrangement (the identity criteria). You then want to be able to say that you saw the equivalent of that object (Chain C) in N cells and that they are all the same (based on an identity criteria). It seems like this is more similar to the Receptor concept to me without the paired nature of the Receptor, where perhaps a Receptor is composed of two Chains???

I think the bottom line for me is that it seems like there should be some value in saying that this Rearrangement came from cell A and that Rearrangement came from that cell B, and they are considered "identical" and here is that thing that defines their identity (the Chain)...

bussec commented 4 years ago

@bcorrie Two notes from the single-cell side, that might resolve some of your concerns:

  1. cell will always partitions a set of Rearrangements, as we assume that a. each experimentally observed nucleic acid MUST be derived from a single cell b. that we have experimental means (e.g. barcodes) to unambiguously perform this assignment. Therefore cell:rearrangement should be 1:N.
  2. However, we are assuming a strict and non-overlapping definition of Repertoire for the schema to work. We discussed e.g. about overlapping cell populations above, and for the current schema, overlapping populations (i.e. Repertoires) MUST NOT exist (we likely will have to introduce a new object to perform grouping). Otherwise either pretty much everything becomes N:N or we would have to start copying large amounts of cell and rearrangement objects, which I consider nonsensical as single-cell is a lot about preserving these identities.
javh commented 4 years ago

I'm experiencing much ambivalence. I'm inclined to agree with @bcorrie. I think @bussec's comment might address the concern, but.. It looks like there are two competing uses cases defined in the schema:

  1. Genes are features. Rearrangement/Receptors are annotations on Cells. Ie, you have a matrix that is Genes x Cells/Samples and a data frame that is keyed by Cells/Samples with N Rearrangement/Receptor annotations per cell.
  2. Rearrangement/Receptors are features. Here you have a matrix that is Rearrangement/Receptor x Cells/Samples.

In the first case, Cells is 1:N with Rearrangement/Receptor and the individual Rearrangements are observations within cells.

In the second case, Rearrangement/Receptor is 1:N with Cells/Samples/Repertoires, because the question you are trying to answer is how often do I see this feature (unique Receptor sequence/pair) in multiple cells/samples. Super common and very important, but not what I thought the purpose of the schema was.

Is that correct? Are we trying to capture both these cases in the same schema?

scharch commented 4 years ago

@javh: only scenario 1. But that still admits for N-to-N mapping (sort of). The idea is that the rows of the Rearrangement/Receptor data frame that you describe could be duplicated between N cells. However, it's not a full mapping, because we really only expect 1 cell_id per Rearrangement/Receptor in those TSVs. So I think in practice, it would still typically be N-to-1 as @bcorrie wants, the only question is whether to allow N-to-N if the user decides that the rearrangement-to-cell mapping isn't important ...which I think means I've talked myself into agreeing with @bcorrie after all, though for a somewhat different reason.

sharikrish commented 4 years ago

However, it's not a full mapping, because we really only expect 1 cell_id per Rearrangement/Receptor in those TSVs. So I think in practice, it would still typically be N-to-1 as @bcorrie wants, the only question is whether to allow N-to-N if the user decides that the rearrangement-to-cell mapping isn't important ..

I agree and as everyone pointed out we always expect 1 cell_id per Rearrangement but on grouping we do observe N:N (unique rearrangements in multiple cells). So, as @scharch pointed out, for us it would be important to know if we need to allow the N:N Cell:Rearrangment relationship in the Schema?

schristley commented 4 years ago

it would be important to know if we need to allow the N:N Cell:Rearrangment relationship in the Schema?

A simple (though maybe uncommon) scenario where multiple cells can reference the same rearrangement is running the "cell inference algorithm" multiple times (e.g. with different parameters) on the same repertoire.

The proposed Cell schema does support N:N and I think we should keep it. The singular cell_id in rearrangements doesn't support N:N but is fine for the more common 1:N scenario. Tools will just need to be aware of that and use the rearrangements array in Cell when necessary.

wyattmcdonnell commented 4 years ago

CCing myself into this to get some 10x involvement and get caught up to speed @wyattmcdonnell

bcorrie commented 4 years ago

@wyattmcdonnell good luck, it is a long conversation 8-)

Documentation for the proposed extension with an early idea on the contents of the Cell object are on the single_cell_ext branch here:

https://github.com/airr-community/airr-standards/blob/single_cell_ext/docs/miairr/miairr_single_cell_extension.rst

schristley commented 4 years ago

@sharikrish As there has been quite a bit of discussion so far, it would probably be a good time to put the Cell schema into airr-schema.yaml on a branch that incorporate all the comments, then people can see a more fully realized schema.

bussec commented 4 years ago

@schristley Regarding your post above: Yes, this is also the way we think about it. Is this (as a general approach of the ADC API) documented somewhere?

Concerning the question of the host, we assumed that this would be the same FQDN and any type of redirection would be transparent to the user. But for a larger federation of repos like iR+, it would be worthwhile to think about mechanisms for this. But this is probably an generic issue and not only related to single-cell.

schristley commented 4 years ago

Is this (as a general approach of the ADC API) documented somewhere?

Not really, except maybe somewhat in the CRWG minutes where it was decided early on to have only two API entrypoints. What drove a lot of that decision was ease-of-use for the end user. When we got to talking about multiple entrypoints and having a "clean, conceptual" interface for each object (study, sample, etc), a non-technical person would invariably ask, "so I can download everything in one file right? I don't have to download a bunch of files and figure out how they all go together?" Furthermore, when we collected a bunch of use case queries, it became clear that many combined fields from different objects (e.g. TCR [pcr_target object] in humans [subject object]).

From there it become a discussion about how to structure all the MiAIRR metadata to support that simple API, balanced against a schema that would be easy to use by analysis tools, and by data entry screens. For me, the composite design pattern seemed the most appropriate, as it allowed for all the objects in MiAIRR to be treated uniformly (for query and access). One could argue the ADC API design follows the facade design pattern, which hides the complexity of the MiAIRR metadata and it's relationships behind a simple interface.

I don't know if I want to take a general AIRR stance, that it should be this way or that way (whatever "this" and "that" are) for all AIRR schema. I'm fine with going by a case-by-case basis. I do think it's important to take 3 stakeholders into account: the users doing queries, the analysis tools, and data entry, when thinking about how the data will be used. When it comes to these new object (cell, receptor, etc.), my intuition tells me that packing more and more into the repertoire object will cause us problems later, so better for these to be loosely coupled. But when it comes to how each of those object are organized, I feel there are trade-offs to be considered.

Imagine this, you can make a valid argument about what a user wants when querying for cells, that 1) a user rarely queries for a single cell object, but almost always wants a set of cells, and 2) they will invariably always need both the repertoire metadata and the rearrangement annotations. Is 1) and 2) true? If yes, then maybe the normal-form is a poor design for that common query behavior. It implies that the user will need to perform many (thousands?) of additional URL requests to gather all of the information they need. Analysis tools may not care, because they likely assume there are just two files, one with cells and one with rearrangements, and linking the two is no problem.

Here's another example. Would we expect that users would like to answer queries such as "give me all the cells for a specific V gene?" A query like this on a /cell entrypoint might be:

{
  "filters": {
    "op":"=",
    "content": {
      "field":"rearrangements.v_call",
      "value":"IGHV6-foo*bar"
    }
  }
}

which isn't immediately supported by the proposed structure. Or does it mean that Cell like Clone should have some of those rearrangements fields promoted up to the Cell object.

So in a long-winded way (can you tell I'm in quarantine ;-D), I don't think the design is set, we will probably want more discussions about the trade-offs. It might be very useful for CRWG to gather query use cases for these new objects.

sharikrish commented 4 years ago

@sharikrish As there has been quite a bit of discussion so far, it would probably be a good time to put the Cell schema into airr-schema.yaml on a branch that incorporate all the comments, then people can see a more fully realized schema.

@schristley : Proposed schema is available now for review in #358

scharch commented 4 years ago

Imagine this, you can make a valid argument about what a user wants when querying for cells, that 1) a user rarely queries for a single cell object, but almost always wants a set of cells, and 2) they will invariably always need both the repertoire metadata and the rearrangement annotations. Is 1) and 2) true?

Yes

Here's another example. Would we expect that users would like to answer queries such as "give me all the cells for a specific V gene?"

Definitely!

bussec commented 4 years ago

Had another look at the definitions in the schema, I think there are still a couple of details that we need to hash out before we can close this:

bcorrie commented 4 years ago
  • Can we refer to a cell_id definition, instead of copying it (like for SampleProcessing)?

Do you mean this definition: https://github.com/airr-community/airr-standards/blob/b0202934e816e181d624163aafe97bcbe28f2d5b/specs/airr-schema.yaml#L2551

And by copy do you mean having the same YAML description?

When sample_processing_id is defined for both Repertoire and Rearrangement the YAML is duplicated, so not sure what you mean here???

bcorrie commented 4 years ago
  • Can we tolerate a nullable status for cell_id?

Again, is this cell_id in the Cell object? I would think that a Cell isn't a Cell without a cell_id (nullable:false), but not 100% sure about that. It also seems logical that cell_id in Rearrangement is nullable:true. I don't think they need to be the same do they?