Which terms do we need as MiAIRR keywords

bussec commented 5 years ago

As discussed in #174, introduction of one of more keyword field in MiAIRR will increase the findability of data sets and reduce abuse of fields intended for experimental annotation. As discussed in today's MiniStd call:

The keyword fields will hold a comma separated list from a predefined vocabulary, free-text will not be allowed.
We will start with a small and flat list of terms. Over time this is expected to evolve and include some hierarchies.
As it is unlikely that there is a single ontology to provide us with all terms, we can potentially cross-link/map our terms to other ontologies.

For a start, we are now collecting suggestion for 10-15 top-level terms that would be frequently used (please focus on content for now):

contains_ig
contains_tcr
contains_single_cell
contains_paired_chain

Pinging @bcorrie @emilyvcbarr @lgcowell @schristley

schristley commented 5 years ago

The keyword fields will hold a comma separated list from a predefined vocabulary, free-text will not be allowed.

We want to avoid encoding lists with specific separators into free-text. My suggestion is that this field is made an explicit JSON/YAML array with contains items that are enums.

schristley commented 5 years ago

I haven't been able to think of any additional keywords to add.

bussec commented 5 years ago

Ok, We'll then start with these four terms. Will put this for up for a decision in the MiniStd call next week.

bcorrie commented 5 years ago

FYI we talked about it in our group meeting, and for the terms themselves, they sound good as a starting point. A couple of questions came up around clarification:

Are just the keys (contains_ig, contains_tcr) a controlled vocabulary, or are the values a controlled vocabulary as well (true/false in these cases)? I think Scott's implementation with enum implies keys and values are controlled vocabularies, but that is something we should discuss... It clearly makes sense for these as they are boolean, but I wonder about the more general case??? I think the intent would probably be yes, but wanted to confirm...
If we use contains_tcr, should we be symmetrical and use contains_bcr? In our discussion, it was thought that this could go either way? Is there a specific meaning that _ig communicates that _bcr doesn't?

javh commented 5 years ago

Or contains_tr, if the reasoning for contains_ig is gene nomenclature.

schristley commented 5 years ago

Are just the keys (contains_ig, contains_tcr) a controlled vocabulary, or are the values a controlled vocabulary as well (true/false in these cases)? I think Scott's implementation with enum implies keys and values are controlled vocabularies, but that is something we should discuss... It clearly makes sense for these as they are boolean, but I wonder about the more general case??? I think the intent would probably be yes, but wanted to confirm...

Actually keyword is the key, while contains_ig, contains_tcr and etc are values. Here the existence of the value within the list implies true for those flags.

keyword: [ contains_ig, contains_paired_chain ]

schristley commented 5 years ago

The other implementation idea is that contains_ig, contains_tcr and etc are keys with boolean values.

contains_ig: true
contains_paired_chain: true

Though I don't think this was the original idea as this makes them individual independent attributes and creating a bunch of them "pollutes" the schema.

bussec commented 5 years ago

Are just the keys (contains_ig, contains_tcr) a controlled vocabulary [...]

As @schristley already mentioned, we thought of keywords_study as an single field, which then contains 0 to n element from of a list of terms. We did not discuss how this is represented in the schema, but Scott's first option seems more lean to me that the explict boolean approach.

If we use contains_tcr, should we be symmetrical and use contains_bcr? In our discussion, it was thought that this could go either way? Is there a specific meaning that _ig communicates that _bcr doesn't?

[ NO!, No, YES! ]

Using "BCR" when refering "Ig" is simply wrong. It is like referring to an engine as "car". I get nosebleeds when I am forced to do this and my desk looked like a slaughterhouse after we submitted the iR+ proposal (just in case anyone here wanted to say "Yes, but in the iR+ proposal....").

Now - in less dramatic words - who is who:

BCR refers exclusively to the surface-bound protein complex on B cells, containing Ig heavy, Ig light and at least the signal transducers Ig[alpha] and Ig[beta], although additional components of the signaling machinery can be present.
Antibody refers exclusively to the soluble variant of n Ig heavy / Ig light heterotetramers (with n typically being 1,2 or 5, depending on the isotype) plus the optional J chain (IgM & IgA ) and SC (IgA). All of these are proteins.
Ig is a general abbreviation for "immunoglobulin" and can be used an a number of context, including Ig loci (DNA), transcripts (RNA) and various polypeptide chains.
TCR on the one hand refers to the BCR counterpart on T cells. However, as the individual chains of the complex are also "TCRx", it is kind of a shorthand. IMGT refers to most of the components as "TR", however outside of the actual locus (@jvdh's comment) this is not often used.

The funny thing is that most people actually agree to these definition, they just don't care. This then leads to weird situations, in which protein sequencing (by mass spec) is termed "Ig-seq" because the term "antibody sequencing" was already used for NGS-based approaches.

Long story short: Happy to discuss about contains_tr, but contains_bcr is a no-go from my side.

bcorrie commented 5 years ago

WRT list of flags - Ahhh, I was thinking in the later sense... Hadn't picked up that the intent was to create a set of boolean flags...

bcorrie commented 5 years ago

WRT Ig VS BCR

There is nothing like a @bussec comment to start your day... 8-)

Sorry about the nose bleeds!

bussec commented 5 years ago

MiniStd WG agreed during the last call that this field should be implemented. @schristley, how does the "list of controlled terms" work in OpenAPI? enum seems to be 1-of-n, but we need n-of-m (https://swagger.io/docs/specification/data-models/enums/).

schristley commented 5 years ago

An array where the items are enums should be sufficient. The schema definition would look something like this:

study_keywords:
    type: array
    items:
        type: string
        enum:
            - contains_tr
            - contains_ig
            - contains_single_cell
            - contains_paired_chain

schristley commented 5 years ago

How about a quantitative flag which indicates if the experimental protocol is such that quantities correspond to actual abundance of receptors/cells in the biology? I think for bulk AIRR-seq only UMI and Adaptive's primer mix are the two quantitative protocols?

scharch commented 5 years ago

The distinction between abundance of receptors and cells is an important one that would presumably end up require separate fields...

airr-community / airr-standards

Which terms do we need as MiAIRR keywords #185