Closed javh closed 2 months ago
Looks fine. Will need docs updates, at minimum for the name of Expression
and Reactivity
. But I think we should also add explicit guidance that says either "put RNAseq data in Expression
" or explain how external data (eg GEO) can be linked at the Sample
level.
Yes, looks good to me as well... Will hold off on review check until final is ready, but I think safe to sync the specs.
Looks good! Wondering whether we should also introduce a cell_label
property into the CellProcessing
object...
From the call:
cell_label
to CellProcessing
.Cell.cell_subset
to miairr: important
. Verify this is clear in the documentation.Hate to say it, but I am trying to use the spec as it is now, and I see an issue...
Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.
The AIRR Spec has the V Domain for both chains as required and non-nullable. Is that really what we want. We would have no way of capturing IEDB:184739 in the AIRR Receptor schema if that was the case???
Some IEDB records do have the full domain (e.g. https://www.iedb.org/receptor/47) but this is not normal. Don't we want the Receptor object to be more flexible than that?
I am OK to make the above a separate issue so we can merge the current PR.
Asking the question another way. Lets say I analyze a single cell study and I find that some cells match (at some level of matching) a known Receptor in IEDB. Lets say the CDR3 and VJ genes are the same. So I create a Receptor
object and link that Cell
to that Receptor by adding the receptor_id
to the Cell.receptors
array.
But I am only allowed to do this if the IEDB entry is a paired chain with full V Domain for both chains. If you look at IEDB and search for paired chain receptors you are limited to 31,346 out of the 189,435 "Receptor like things" in IEDB. I am not 100% sure "paired chain == full v domain for both chains", so this might be limited even further.
This seems like a significant restriction. Should we be relaxing the constraints on the Receptor object?
Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.
It's a TCR, so the full V domain is implied. I think it's ok.
Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.
It's a TCR, so the full V domain is implied. I think it's ok.
Problem is that the receptor_variable_domain_1_aa
field is required (as is the domain 2 field), non-nullable, and says:
receptor_variable_domain_1_aa:
type: string
description: >
Complete amino acid sequence of the mature variable domain of the Ig heavy, TCR beta or TCR delta chain.
The mature variable domain is defined as encompassing all AA from and including first AA after the the
signal peptide to and including the last AA that is completely encoded by the J gene.
example: >
QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGRIDPNSGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSAVYYCARYDYYGSSYFDYWGQGTTLTVSS
x-airr:
nullable: false
adc-query-support: true
So because we don't have the AA variable domains, you can't create an AIRR standards compliant Receptor
object for https://www.iedb.org/receptor/184739
It seems to me that our requirements are to stringent.
The stringency in the schema is deliberate, as the purpose of Receptor
was always to be complementary to the potentially incomplete receptor information in IEDB. People are free to download the IEDB database and all Receptor
objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.
I understand that use case, but if that is indeed the case, then we do not have a mechanism to link an AIRR entity to a known entity in another repository (e.g. IEDB) unless it meets those strict criteria.
There are 150,000+ such entities in IEDB that contain Receptor -> antigen/epitope reactivity information that we are unable to link to given the above constraints. My understanding of the AIRR Receptor
object was that it would help us make such a link. You aren't telling me I have been wrong all this time are you 8-)
People are free to download the IEDB database and all
Receptor
objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.
But shouldn't our schema have a mechanism to capture when such an algorithm is used to annotate an AIRR entity (Cell
or Rearrangement
) with such specificity/reactivity information.
Suggest we move this discussion to #781 and merge this PR.
There's currently no Cell
documentation to verify w.r.t. CL importance, so we'll come back to it.
Initial changes to
Cell
,CellExpression
andCellReactivity
per March 4, 2024 call:Rearrangement
and expression links inCell
.CellExpression
toExpression
.CellReactivity
toReactivity
.Also:
cell_subset
,cell_phenotype
, andcell_label
for CL term, marker definition, and free text annotation, respectively.CellProcessing
object.cell_phenotype
, but I thought I'd include it for discussion purposes.I only modified the main v2 spec. I will propagate once we agree on the changes.
Closes #768 Closes #477