Restructuring of Cell - Githubissues

javh commented 4 months ago

Initial changes to Cell, CellExpression and CellReactivity per March 4, 2024 call:

Remove Rearrangement and expression links in Cell.
Renamed CellExpression to Expression.
Renamed CellReactivity to Reactivity.

Also:

Added cell_subset, cell_phenotype, and cell_label for CL term, marker definition, and free text annotation, respectively.
Not sure I like these names, but I stuck with what was already in the CellProcessing object.
Not sure we need cell_phenotype, but I thought I'd include it for discussion purposes.

I only modified the main v2 spec. I will propagate once we agree on the changes.

Closes #768 Closes #477

scharch commented 4 months ago

Looks fine. Will need docs updates, at minimum for the name of Expression and Reactivity. But I think we should also add explicit guidance that says either "put RNAseq data in Expression" or explain how external data (eg GEO) can be linked at the Sample level.

bcorrie commented 4 months ago

Yes, looks good to me as well... Will hold off on review check until final is ready, but I think safe to sync the specs.

bussec commented 3 months ago

Looks good! Wondering whether we should also introduce a cell_label property into the CellProcessing object...

javh commented 3 months ago

From the call:

Add cell_label to CellProcessing.
Set Cell.cell_subset to miairr: important. Verify this is clear in the documentation.
Sync specs.
Merge.

bcorrie commented 3 months ago

Hate to say it, but I am trying to use the spec as it is now, and I see an issue...

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

The AIRR Spec has the V Domain for both chains as required and non-nullable. Is that really what we want. We would have no way of capturing IEDB:184739 in the AIRR Receptor schema if that was the case???

Some IEDB records do have the full domain (e.g. https://www.iedb.org/receptor/47) but this is not normal. Don't we want the Receptor object to be more flexible than that?

bcorrie commented 3 months ago

I am OK to make the above a separate issue so we can merge the current PR.

bcorrie commented 3 months ago

Asking the question another way. Lets say I analyze a single cell study and I find that some cells match (at some level of matching) a known Receptor in IEDB. Lets say the CDR3 and VJ genes are the same. So I create a Receptor object and link that Cell to that Receptor by adding the receptor_id to the Cell.receptors array.

But I am only allowed to do this if the IEDB entry is a paired chain with full V Domain for both chains. If you look at IEDB and search for paired chain receptors you are limited to 31,346 out of the 189,435 "Receptor like things" in IEDB. I am not 100% sure "paired chain == full v domain for both chains", so this might be limited even further.

This seems like a significant restriction. Should we be relaxing the constraints on the Receptor object?

scharch commented 3 months ago

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

It's a TCR, so the full V domain is implied. I think it's ok.

bcorrie commented 3 months ago

Many of the Receptors in IEDB do not have the full variable domains for one chain, let along multiple chains. e.g. https://www.iedb.org/receptor/184739 has V, J, CDR3 for alpha and beta chains, but there is no full V domain.

It's a TCR, so the full V domain is implied. I think it's ok.

Problem is that the receptor_variable_domain_1_aa field is required (as is the domain 2 field), non-nullable, and says:

        receptor_variable_domain_1_aa:
            type: string
            description: >
                Complete amino acid sequence of the mature variable domain of the Ig heavy, TCR beta or TCR delta chain.
                The mature variable domain is defined as encompassing all AA from and including first AA after the the
                signal peptide to and including the last AA that is completely encoded by the J gene.
            example: >
              QVQLQQPGAELVKPGASVKLSCKASGYTFTSYWMHWVKQRPGRGLEWIGRIDPNSGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSAVYYCARYDYYGSSYFDYWGQGTTLTVSS
            x-airr:
                nullable: false
                adc-query-support: true

So because we don't have the AA variable domains, you can't create an AIRR standards compliant Receptor object for https://www.iedb.org/receptor/184739

It seems to me that our requirements are to stringent.

bussec commented 3 months ago

The stringency in the schema is deliberate, as the purpose of Receptor was always to be complementary to the potentially incomplete receptor information in IEDB. People are free to download the IEDB database and all Receptor objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.

bcorrie commented 3 months ago

I understand that use case, but if that is indeed the case, then we do not have a mechanism to link an AIRR entity to a known entity in another repository (e.g. IEDB) unless it meets those strict criteria.

There are 150,000+ such entities in IEDB that contain Receptor -> antigen/epitope reactivity information that we are unable to link to given the above constraints. My understanding of the AIRR Receptor object was that it would help us make such a link. You aren't telling me I have been wrong all this time are you 8-)

bcorrie commented 3 months ago

People are free to download the IEDB database and all Receptor objects from the ADC and use whatever algorithm and arbitrary threshold they want to make guesses whether they are sufficiently similar. But I don't think that we need to degrade our schema for that.

But shouldn't our schema have a mechanism to capture when such an algorithm is used to annotate an AIRR entity (Cell or Rearrangement) with such specificity/reactivity information.

bcorrie commented 3 months ago

Suggest we move this discussion to #781 and merge this PR.

javh commented 2 months ago

There's currently no Cell documentation to verify w.r.t. CL importance, so we'll come back to it.

airr-community / airr-standards

Restructuring of Cell #776