airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

How do I capture known antigen/epitope reactivity to AIRR objects (Rearrangement/Cell) #781

Closed bcorrie closed 3 weeks ago

bcorrie commented 8 months ago

Moving discussion of linking AIRR objects of known epitope/antigen reactivity/specificity to this issue and out of #776

javh commented 7 months ago

Discussion starts here: https://github.com/airr-community/airr-standards/pull/776#issuecomment-2027483756

javh commented 7 months ago

From the call:

Reactivity:

Rearrangement:

bcorrie commented 7 months ago

PR with initial attempt created in #784

Need to look at reactivity_method and reactivity_readout controlled vocabulary terms.

For reactivity_method I added two, annotated for a reactivity record that is annotated from an external source (e.g. IEDB) and inferred for something that is inferred using a computational method (e.g. tcrmatch).

For reactivity_readout I could really only think of adding a confidence readout that for inference would capture the confidence of the inference and for annotated would suggest the "level of annotation" somehow.

bcorrie commented 7 months ago

This might be silly, but...

By "level of annotation" I mean capturing how complete the match is between the entity in the ADC and the entity in the external repository (e.g. IEDB). It is kind of a quality score that describes how exact the match is between the two repositories.

In IEDB we can have receptors with full v domain for both chains (a Receptor in the ADC definition) all the way down to I believe a single CDR3 from one chains (I would need to confirm that).

In the ADC we can have a similar range of completeness for Cells (e.g. Both chains fully annotated through to a Cell with a single rearrangement with not necessarily all of junction_aa, v_call, d_call, j_call populated for that rearrangement)

When annotating a Cell/Rearrangement with a Reactivity record that maps to an external resource, how important is it that we capture that completeness. This can be computed by checking the external record (IEDB) and the Cell/Rearrangements, but when we assign the Reactivity we know the level of completeness, so should we not annotate it?

bcorrie commented 7 months ago

From the meeting:

Drop the old fields:

Create an example for this for an "observed" rearrangement that was observed in IEDB for discussion.

bcorrie commented 7 months ago

Here is an example of using Reactivity to associate a Cell with observed reactivity in an AIRR-seq study:

https://github.com/airr-community/airr-standards/issues/704#issuecomment-1891063546

bcorrie commented 7 months ago

Example of annotated Rearrangement with Reactivity:

curl -d '{"filters": {"op": "=", "content": { "field": "sequence_id", "value":"636f50e6c90058221e522b92" }},"fields":["v_call", "j_call", "junction_aa"]}' https://t1d-1.ireceptor.org/airr/v1/rearrangement

    "junction_aa": "CASSLQSSYNSPLHF",
    "v_call": "TRBV11-2*01",
    "j_call": "TRBJ1-6*02"

This is the beta chain of a known receptor in IEDB: https://www.iedb.org/receptor/182992 which is specific to this epitope: https://www.iedb.org/epitope/1616345

My equivalence criteria for annotations is that the CDR3, V and J calls need to be an exact match. This applies.

I want to annotate this so I create a Reactivity object as follows:

reactivity_id : XXXXXXX # ID for this object
cell_id: null # Not associated with a cell
ligand_type - peptide
antigen_type - peptide
antigen_source_species: Homo Sapiens
antigen: {Insulin, UniProt:P01308}
peptide_sequence_aa: QKRGIVEQCCTSICS
peptide_sequence_start: 87
peptide_sequence_end: 101
reactivity_method: observed
reactivity_ref: [IEDB_RECEPTOR:182992]

I then set for this Rearrangement Rearrangement.reactivty_id = XXXXXXX

If I have another Rearrangement that matches I can also set its Rearrangement.reactivty_id = XXXXXXX

The only problem I see is that there is no way for me to differentiate this Reactivity from one that is actually observed in this study directly.

I would still lean to having reactivity_method to have a third enum "annotated" which would be used in this case. Annotated implies observed, but not observed directly in this study. That is I am annotating a Rearrangement with an observed reactivity but the observation was from an external source.

bcorrie commented 7 months ago

FYI, in the T1D repository, there are 21 Rearrangements (out of >100 Million Rearrangements) that match on this CDR3/V/J

If I was to fully annotate this repository for that reactivity record, I would have 21 Rearrangements that would have XXXXXXX in their reactitivty_id field.

I should also note that there are 347 such Rearrangements in the ADC (out of 5 Billion Rearrangements) that could ultimately be annotated in this manner (if the repositories chose to do this).

If you flip this around, what we have done is taken one Receptor from IEDB and annotated all of the Rearrangements (347 of them) in the ADC that match on the CDR3/V/J exact match criteria.

There are 180297 Receptors from Homo Sapiens in IEDB. For sake of argument, lets say that on average we get a similar match to the above (350 Rearrangements/Receptor). We would then have 63 Million annotated Rearrangements across all of the Repertoires in the ADC annotated with some sort of antigen/epitope specificity. That is ~1% of the 5 Billion Rearrangements in the ADC annotated with some sort of disease specificity. How interesting would that be???

Who knows what the real frequency per IEDB Receptor might be, but it would be interesting to find out!

bcorrie commented 7 months ago

One other quick observation, the alpha chain for the IEDB Receptor is TRAV9-2/CALRTDRGSTLGRLYF/TRAJ18. There are 26 of these in the ADC, so ultimately we would have 347 + 26 Rearrangements in the ADC point to Reactivity record XXXXXXX (IEDB_RECEPTOR:182992) if the repositories in the ADC were to completely annotate the ADC for IEDB Receptor IEDB_RECEPTOR:182992.

In reality each repository would have a different Reactivity record, but they would all point to IEDB_RECEPTOR:182992

bcorrie commented 7 months ago

For completeness, here is an example of a Reactivity object created from an observation in an actual AIRR-seq study. This is detected with a dextramer barcode and the Cell in the ADC does not match any Receptors in IEDB for this Epitope.

reactivity_id: YYYYYYY
cell_id: CCCCCCC
ligand_type:    MHC:peptide
antigen_type:   peptide
antigen:  {"id:: "NCBI:YP_009725307.1, "label": "RNA-dependent RNA polymerase (nsp12)"}
antigen_source_species  { "id": "NCBITAXON:2697049", "label" : "Severe acute respiratory syndrome coronavirus 2"
peptide: DTDFVNEFY
peptide_start:  738
peptide_end:    746
mhc_class:  MHC-I
mhc_gene_1: id: MRO:0000046, label: HLA-A
mhc_allele_1:   HLA-A*01:01
reactivity_method: observed

This is what we would have stored previously, but with the suggestion that we don't store these things any more, we lose this information.

reactivity_method   dextramer barcoding
reactivity_readout  barcode count
reactivity_value    23
reactivity_unit absolute count

This seems important to me, are we sure we want to drop this?

bcorrie commented 7 months ago

@javh @bussec any comment on the above - or should I go ahead and make the changes from the last meeting. I still hestitate to remove the reactivity_method, reactivity_readout, reactivity_value, reactivity_unit because of the use case above...

bcorrie commented 7 months ago

My suggestion is to consider reactivity_method having: observed, annotated, inferred as possible string, and if observed one can/should fill out reactivity_method, reactivity_readout, reactivity_value, reactivity_unit and if annotated or inferred they can/should be left blank...

We would need to figure out names for the two fields that are reactivity_method in this case.

Thoughts?

javh commented 7 months ago

I still hestitate to remove the reactivity_method, reactivity_readout, reactivity_value, reactivity_unit because of the use case above...

IIRC, our logic from the last call was there were certainly valid use cases for this information, but that we didn't want to hassle with enumerating them all because there's quite a few experimental approaches we'd need to accommodate if we went down that road. So we're trading schema completeness for tractability.

My suggestion is to consider reactivity_method having: observed, annotated, inferred as possible string [...]

I think reactivity_method = annotated has semantic problems. Umm, can you accomplish what you want with reactivity_ref? If there's an IEDB (or other ref) in that field, then that implies "annotated".

Could reactivity_ref link to a repertoire_id? Is that something we do?

bcorrie commented 7 months ago

I think reactivity_method = annotated has semantic problems. Umm, can you accomplish what you want with reactivity_ref? If there's an IEDB (or other ref) in that field, then that implies "annotated".

Based on discussion at the meeting today, I think we are leaning to have observed and inferred, with the use case of referencing a reactivity from an external resource like IEDB being considered an inferred reactivity and by having reactvity_ref: IEDB_RECEPTOR:182992 stating that source of the inference is from an IEDB entity. I think this would make sense to me. The Reactivity being captured in this case is an inference based on data fron an external repository (IEDB) with the inference being determined by matching the ADC entity (Cell or Rearrangement) to the entity in IEDB to some level of matching criteria.

Just like if using an inference algorithm that works on some closeness criteria of the CDR3 to predict reactivity, we are using an algorithm with some matching criteria to a Receptor in IEDB to predict reactivity based on the Receptor/Reactivity data in IEDB.

I think this works.

bcorrie commented 7 months ago

Could reactivity_ref link to a repertoire_id? Is that something we do?

Haven't thought of that to date, but...

Since we are documenting inference of a Cell or Rearrangement reactivity by linking it to data in an external repository, this could be any repository (IEDB, VDJdb, MCPAS, etc). If the only place that this Reactivity is documented is in a repository in the ADC, I suppose it would not be out of the question. We are then linking to an internally observed reactivity. I am not sure exactly what object to link to. I am not sure it would be the repertoire_id. It would almost be another Reactivity wouldn't it?

I think this is a pretty extreme edge case?

I suppose that my hope would be that any Receptor/Reactivity evidence found in a study in the ADC would be curated into a repository like IEDB, where other evidence for that Receptor/Reactivity would reside. There is a level of curation process and methodology around that, where evidence is gathered through multiple studies and assays to provide evidence that supports that Receptor/Reactivity.

When storing an inferred reactivity in the ADC, by using such an external resource such as IEDB, we get stronger evidence to support the inference. With that said, since the ADC is storing Reactivity, your use case should not be considered out of the question I suppose.

bcorrie commented 7 months ago

I would almost suggest that the flow might go the other way. If a study stored in the ADC detects a Reactivity to a Cell or Rearrangement that is directly observed through that study (e.g. the tetramer bar code example above) then this is a primary observation of the phenomenon. I could see how IEDB could use the ADC to look for observed reactivity and use that to help them find data to curate into IEDB. All of the metadata for the study would be in the ADC, so presumably this would make it easy(ier) for them to curate that data in IEDB than curating the data from the paper itself. @rvita @bpeters42 would that make sense?

Once it is curated in IEDB, then any inferred reactivity of a Cell or Rearrangement in the ADC would include the "self-evidence" but would also include any other evidence for that reactivity that is stored in IEDB. IEDB in this case would not be interested in inferred Reactivity in the ADC but they would be interested in observed Reactivity in the ADC.

bcorrie commented 7 months ago

IIRC, our logic from the last call was there were certainly valid use cases for this information, but that we didn't want to hassle with enumerating them all because there's quite a few experimental approaches we'd need to accommodate if we went down that road. So we're trading schema completeness for tractability.

Small steps I suppose. 8-)

We have in the past used a string to keep such fields flexible, with documentation on recommended or possible examples to guide users in the right direction. That way we don't need to be complete, but do enable the ability to capture the data. One possibility... 8-)

If we don't have something like this, when we (iReceptor) load these data we will probably just do this ourselves by storing the data in custom internal iReceptor fields. We don't want to not load this data when we are loading everything else, and then try and load it piecemeal after we come up with a mechanism to do it properly 8-) Easier for us to store these fields in our best guess and then map/convert them later if necessary...

So look for the fields `ir_reactivity_method, ir_reactivity_measure, ir_reactivity_value, ir_reactivity_unit' in our repositories 8-)

rvita commented 6 months ago

If I understand correctly, you are asking if the ADC had published epitope specific experimental data that was not yet in the IEDB, we (the IEDB) could use the ADC to identify and curate it? If so, yes, as long as the data is connected to its PMID.

bcorrie commented 3 months ago

This is what we would have stored previously, but with the suggestion that we don't store these things any more, we lose this information.

reactivity_method dextramer barcoding
reactivity_readout    barcode count
reactivity_value  23
reactivity_unit   absolute count

This seems important to me, are we sure we want to drop this?

I think the crux of what remains is what we do with these fields. The discussion above seems to imply that our meetings came to the conclusion that these should be removed from the Reactivity object - but as I say in one of my comments above this seems really important. The main reason why we want to remove it is that it became "complicated" to capture this information in the context of linking a Rearrangement to Reactivity. In our earlier discussions around Cell and Reactivity (https://github.com/airr-community/airr-standards/issues/704#issuecomment-1891063546) we agreed that these fields would work well and that this was important.

So it seems wrong to me to remove the fields we decided we needed for Cell->Reactivity interactions because we are adding Rearrangement->Reactivity interactions. If this is the case, then I would suggest that we need a different way to link Rearrangements to Reactivity rather than dumb down our Reactivity object.

bcorrie commented 3 months ago

This pull request is currently adding a minor field to Rearrangement but removing important functionality from Reactivity - which was not what was intended for this issue and the related pull request.

The original intent for Reactivity was to capture observed reactivity (observed in the AIRR-seq study) between a Cell and an epitope. We are mudding the waters with trying to use this object to capture inferred reactivity rather than observed reactivity.

If we can't come up with another solution, I would rather drop reactivity_id from Rearrangement, drop the idea of capturing the inferred reactivity with the Reactivity object, and not remove the above fields from Reactivity. Having those fields was a significant part of the original intent of the object. Whenever we try to dual purpose an object like this that doesn't seem natural we always regret it.

If we need to come up with a different way of capturing inferred reactivity then I would prefer some other mechanism over changing Reactivity as we are discussing here.