Closed bussec closed 5 years ago
The keyword fields will hold a comma separated list from a predefined vocabulary, free-text will not be allowed.
We want to avoid encoding lists with specific separators into free-text. My suggestion is that this field is made an explicit JSON/YAML array with contains items that are enums
.
I haven't been able to think of any additional keywords to add.
Ok, We'll then start with these four terms. Will put this for up for a decision in the MiniStd call next week.
FYI we talked about it in our group meeting, and for the terms themselves, they sound good as a starting point. A couple of questions came up around clarification:
Or contains_tr
, if the reasoning for contains_ig
is gene nomenclature.
- Are just the keys (contains_ig, contains_tcr) a controlled vocabulary, or are the values a controlled vocabulary as well (true/false in these cases)? I think Scott's implementation with enum implies keys and values are controlled vocabularies, but that is something we should discuss... It clearly makes sense for these as they are boolean, but I wonder about the more general case??? I think the intent would probably be yes, but wanted to confirm...
Actually keyword
is the key, while contains_ig
, contains_tcr
and etc are values. Here the existence of the value within the list implies true for those flags.
keyword: [ contains_ig, contains_paired_chain ]
The other implementation idea is that contains_ig
, contains_tcr
and etc are keys with boolean values.
contains_ig: true
contains_paired_chain: true
Though I don't think this was the original idea as this makes them individual independent attributes and creating a bunch of them "pollutes" the schema.
- Are just the keys (contains_ig, contains_tcr) a controlled vocabulary [...]
As @schristley already mentioned, we thought of keywords_study
as an single field, which then contains 0 to n element from of a list of terms. We did not discuss how this is represented in the schema, but Scott's first option seems more lean to me that the explict boolean approach.
- If we use contains_tcr, should we be symmetrical and use contains_bcr? In our discussion, it was thought that this could go either way? Is there a specific meaning that _ig communicates that _bcr doesn't?
[ NO!, No, YES! ]
Using "BCR" when refering "Ig" is simply wrong. It is like referring to an engine as "car". I get nosebleeds when I am forced to do this and my desk looked like a slaughterhouse after we submitted the iR+ proposal (just in case anyone here wanted to say "Yes, but in the iR+ proposal....").
Now - in less dramatic words - who is who:
BCR
refers exclusively to the surface-bound protein complex on B cells, containing Ig heavy, Ig light and at least the signal transducers Ig[alpha] and Ig[beta], although additional components of the signaling machinery can be present.
Antibody
refers exclusively to the soluble variant of n Ig heavy / Ig light heterotetramers (with n typically being 1,2 or 5, depending on the isotype) plus the optional J chain (IgM & IgA ) and SC (IgA). All of these are proteins.
Ig
is a general abbreviation for "immunoglobulin" and can be used an a number of context, including Ig loci (DNA), transcripts (RNA) and various polypeptide chains.
TCR
on the one hand refers to the BCR counterpart on T cells. However, as the individual chains of the complex are also "TCRx", it is kind of a shorthand. IMGT refers to most of the components as "TR", however outside of the actual locus (@jvdh's comment) this is not often used.
The funny thing is that most people actually agree to these definition, they just don't care. This then leads to weird situations, in which protein sequencing (by mass spec) is termed "Ig-seq" because the term "antibody sequencing" was already used for NGS-based approaches.
Long story short: Happy to discuss about contains_tr
, but contains_bcr
is a no-go from my side.
WRT list of flags - Ahhh, I was thinking in the later sense... Hadn't picked up that the intent was to create a set of boolean flags...
WRT Ig VS BCR
There is nothing like a @bussec comment to start your day... 8-)
Sorry about the nose bleeds!
MiniStd WG agreed during the last call that this field should be implemented. @schristley, how does the "list of controlled terms" work in OpenAPI? enum
seems to be 1-of-n, but we need n-of-m (https://swagger.io/docs/specification/data-models/enums/).
An array where the items are enums should be sufficient. The schema definition would look something like this:
study_keywords:
type: array
items:
type: string
enum:
- contains_tr
- contains_ig
- contains_single_cell
- contains_paired_chain
How about a quantitative
flag which indicates if the experimental protocol is such that quantities correspond to actual abundance of receptors/cells in the biology? I think for bulk AIRR-seq only UMI and Adaptive's primer mix are the two quantitative protocols?
The distinction between abundance of receptors and cells is an important one that would presumably end up require separate fields...
As discussed in #174, introduction of one of more
keyword
field in MiAIRR will increase the findability of data sets and reduce abuse of fields intended for experimental annotation. As discussed in today's MiniStd call:keyword
fields will hold a comma separated list from a predefined vocabulary, free-text will not be allowed.For a start, we are now collecting suggestion for 10-15 top-level terms that would be frequently used (please focus on content for now):
contains_ig
contains_tcr
contains_single_cell
contains_paired_chain
Pinging @bcorrie @emilyvcbarr @lgcowell @schristley