Finalizing Rearrangement fields

javh commented 6 years ago

I just finished making all the (relevant) Change-O tools work with the AIRR format. Here's the set of spec changes I think are necessary after going through that as well as input from a few other discussions. Some of those discussions were offline, but some of it's covered in the issues: #84, #83, #82, #75, #58.

Before making a pull request with changes, I want to hear others' thoughts as some of these things require naming decisions which I'm undecided on. For all fields listed, I've proposed a non-exhaustive list of alternative names. I tend to prefer short names, but there's a good argument for longer, more explicit names as well.

Fields to modify

rearrangement_id (rearrangment)

Remove from required.
Add the tag x-crwg: true to denote the field as required in the CRWG context.
Update the definition to clarify that the field serves as a unique identifier for the Rearrangement object and is not necessarily derived from the input sequence or alignment tool result.

rearrangement_set_id (rearrangement_set)

Remove from required.
Add the tag x-crwg: true to denote the field as required in the CRWG context.
Update the definition to clarify that the field serves as a unique identifier for an arbitrary collection of Rearrangement object and is not necessarily a meaningful cross-study, biological classification.

cell_index

Remove from required.
Change type to string.

sequence (query_sequence, sequence_input)

Clarify in the definition that this field contains the unmodified sequence that was input into the tool. Alternatively, this may contain an intermediate "output" sequence in the sense of a consensus sequence, unique sequence, assembled mate-pair sequence, or other results of pre-V(D)J alignment processing embedded in the alignment tool. Basically, whatever the alignment field (below) was based on that meets the NCBI GenBank submission criteria: "an uncorrected sequence, which may have gone through assembly or consensus, but not indel correction or truncation".

functional

Rename to productive.

Necessary fields to add

The following fields are missing, but necessary:

sequence_id (query_id)

String. The identifier for the sequence.
This replaces rearrangement_id as the row key for the Rearragenment object, within the context of an analysis file (the default context). The default expectation is that this would be equal to the input sequence header, but that is not strictly required, especially in cases where sequence corresponds to such an above described intermediate sequence.
The TL;DR of why this is needed is because rearrangment_id and sequence_id have conflicting semantics, depending upon whether we are in an analysis file or database context.

alignment (aligned_sequence / sequence_alignment)

Nucleotide string. The aligned output sequence (typically just the V(D)J region).
This differs from sequence in that in may include gaps, deletions, masking or trimming so that it is aligned against the reference sequence in the germline field (below). Some examples of what this field looks like are SEQUENCE_VDJ/SEQUENCE_IMGT from changeo, V-D-J-REGION in the 2_IMGT-gapped-nt-sequences and 3_Nt-sequences files from IMGT/HighV-QUEST, the mature_seq field in partis.
My opinion is that this should be a required field because it's a primary analysis field. However, it's not universal output. For example, IgBLAST doesn't (presently) output this field for the full V(D)J. Instead, it has individual alignment output in the query seq column of the V/D/J hit table section.
We could also support the above via v_alignment, d_alignment, and j_alignment fields.

germline (germline_sequence)

Nucleotide string. The aligned full germline sequence (stitched V(D)J germlines).
Should be aligned with the sequence in the alignment field. Some examples of what this field looks like are GERMLINE_VDJ/GERMLINE_IMGT from changeo and the naive_seq field in partis.
As with alignment, IgBLAST doesn't provide a full length alignment for the germline, but does provide individual V/D/J aligned germlines via the subject seq column which we could support via v_germline, d_germline, and j_germline fields.
I also think this should be required to compel tool writers to provide it.

Optional fields to add

The following missing fields are not necessarily needed, but I think we should add them as they are common aligner output.

chain (chain_type)

String. Specifier for the chain. Eg, VH, VL.

stop

Boolean. Whether the V(D)J sequence contains a stop codon.

vj_in_frame (v_j_in_frame)

Boolean. If the V and J are in frame.

mutated_invariant

Boolean. If an invariant codon is mutated or missing.

Additional aligned sequence fields to add

These fields are not required, but they appear in the output of several tools, so I think we should add them as official optional fields.

np1 (np1_nt, np1_sequence, np1_seq, np1_alignment)
np2 (np2_nt, np2_sequence, np2_seq, np2_alignment)
cdr1 (cdr1_nt, cdr1_sequence, cdr1_seq, cdr1_alignment)
cdr2 (cdr2_nt, cdr2_sequence, cdr2_seq, cdr3_alignment)
cdr3 (cdr3_nt, cdr3_sequence, cdr3_seq, cdr3_alignment)
fwr1 (fwr1_nt, fwr1_sequence, fwr1_seq, fwr1_alignment)
fwr2 (fwr2_nt, fwr2_sequence, fwr2_seq, fwr2_alignment)
fwr3 (fwr3_nt, fwr3_sequence, fwr3_seq, fwr3_alignment)
fwr4 (fwr4_nt, fwr4_sequence, fwr4_seq, fwr4_alignment)

The big issues here are what to call them, whether this implies a name change to the junction_nt field, and how this defines the names of positional fields (below).

By default, we have been assuming all sequences are nucleotides, so that sequence implies sequence_nt. I think this is easier than having an _nt and _aa field for everything.

There's also an implication that cdr1_sequence is the cdr1 sequence as it appears in the input sequence (sequence field) and cdr1_alignment is the cdr1 as it appears in the output sequence (alignment field). Whereas, cdr1 implies neither, but we can define it as extracted from the alignment field in the description.

Clarify naming of positional fields

Related to the above, maybe we should modify the names of the positional fields so they imply a direct relationship to the sequence field that they correspond to? For example:

v_start to v_sequence_start or v_alignment_start
v_end to v_sequence_end or v_alignment_end
v_germ_start to v_germline_start
v_germ_end to v_germline_end
d_start to d_sequence_start or d_alignment_start
d_end to d_sequence_end or d_alignment_end
d_germ_start to d_germline_start
d_germ_end to d_germline_end
j_start to j_sequence_start or j_alignment_start
j_end to j_sequence_end or j_alignment_end
j_germ_start to j_germline_start
j_germ_end to j_germline_end

Note, I think we actually need both *_sequence_start and *_alignment_start - one for NCBI GenBank submission (*_sequence_start) and one for analysis purposes (*_alignment_start), but I don't think we need to include both in the spec, because we can derive one from the other using the CIGAR string.

schristley commented 6 years ago

Thanks for summarizing this for us @javh.

I'm in the camp of having more fields specified is better. Even a field that can be derived from another, if the field is useful and/or being used by a tool, then I prefer it to be specified. That means I'm basically good with all the additional fields and would also advocate for both *_sequence_start and *_alignment_start fields to be added.

For naming, I also prefer the names to be more specific, the exception being some "core" fields like sequence is fine, it doesn't need to be sequence_input. Especially if a field might have different contexts, like you point out where fields may refer to either the input sequence or the output alignment.

It seems reasonable to assume all sequences are nt and thus use _aa for when it's amino acids. However, I also think it's reasonable to put _nt on the field if it may not necessarily be obvious that it's a sequence. Here I side on being more specific. I'm okay with junction_nt, I also prefer fwr3_nt over just fwr3 or fwr3_sequence. Is there any reason to provide (say) fwr3 as amino acids? Maybe, so fwr3 and fwr3_sequence might be ambiguous while fwr3_nt and fwr3_aa are less so. But I agree with your point that if fwr3 might refer to either the sequence or the alignment, then we need to handle both, in that case fwr3_sequence and fwr3_alignment seems appropriate with the _nt implied.

I would eliminate the x-crwg tag, 1) because it isn't useful to the CRWG API spec nor to the AIRR library, and 2) crwg is probably not the best name. I'm fine with rearrangement_set and rearrangement_set_id not being required for now. It's probably premature to specify these technical details anyways until CRWG is further along.

Lastly, for sequence_id, it might be useful for the AIRR library to generate one if one isn't provided, which could be helpful for tools that depend upon it. Of course, tools could just as easily do it themselves.

jianye00 commented 6 years ago

Here is what I thought.

Under "Fields to modify", everything look fine except for the "sequence" field . I thought we should be strict about names and definitions for required fields so that we have a consistency in what they represent. I'd use "query_sequence" to avoid any ambiguity. As for definition, I thought it should only mean the original query sequence itself (or reversed and complemented as is defined in the original specs). If a tool wants to provide a sequence string of some other kind, it can be provided by a custom field.
"Necessary fields to add". 1) query_id seems to be a better name as reasoned above. 2) alignment (aligned_sequence / sequence_alignment). I am not sure about the purpose of this. Isn't alignment already represented by the existing CIGAR alignment field? 3) germline (germline_sequence). Same as 2) above.
"Optional fields to add" 1) chain (chain_type). If you took this from igblast, this means VH, VK, VL for B cell recepotor, VB, VA, VG, or VD for TCR. I'd prefer "chain_type" since it's a little more clear but I am fine with alternative names.
2) I'd use stop_codon to be clear. 3) vj_in_frame (v_j_in_frame). vj_in_frame is better since it's more consistent with existing names like np1_seq, etc. 4) mutated_invariant. I am not what it refers to since we have a few different invariant amino acid positions.
"Additional aligned sequence fields to add".
Since the current specs have CIGAR string to represent alignment already, can't these additional alignment fields could be treated as custom fields and let the tools themselves to define?
"Clarify naming of positional fields".
I'd use "v_query_start" and "v_germline_start", etc to be clear. v_query_start indicates the first position in the query sequence that is aligned to the germline V sequence. I am not sure what you meant by referring to both sequence_start and alignment_start.

javh commented 6 years ago

Just wanted to address @jianye00's questions. (No opinions to offer yet.)

alignment, germline

This is partially addressed by the CIGAR string, yes, but not entirely because the CIGAR strings are by segment. Though, vdj_cigar could cover it, with masking for N/P regions. Not sure if that would conflict with how the alignment process works for some tools.

A simple example of what these two fields look like would be:

 sequence> CCCCCCACTGGNAGTTGACGAAGAGGGGCTGATTATTAGCTTCAGTCGTATGACGGGGACCAACGTCAGGGGGG
alignment>      CACTGGNAGTTGACGAAGAGGGGCTGATTATTA---GCTTCAGTCGTATGACGGGGACCAACGTCA
 germline>      CACTGGGAGCTGAGGATGA--GGCTGATTATTAAGTGGTTNNNNCGTGTGANNGGGACCAAGGTCA
   region> LLLLLVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVNNNNDDDDDDDNNJJJJJJJJJJJJJCCCCCC

Basically, the alignment field is just the concatenated top V hit query seq, N/P1, D hit query seq, N/P2, and J hit query seq in the IgBLAST output (to make a full length aligned, observed sequence). The germline field is then the corresponding full length inferred, aligned, germline sequence made using the subject seq fields instead.

These two fields are key analysis fields for us, because we do a lot of direct string comparisons between the full length observed and germline V(D)J sequences. While we should be able to make these from the CIGAR strings and N/P positions, that just adds an extra processing step before we can actually use the file for analysis.

mutated_invariant

This comes from the IMGT/HighV-QUEST output and how they define productive. If any invariant residue is either mutated or absent, then they note it. It's kind of like the stop field, which says a stop codon is found, but not where or how many. Though, maybe it would be better as string specifying the invariant. Eg, C104P being cystine 104 to proline.

I don't think it's a necessary field. Certainly not something I think IgBLAST needs to output. It'd just be present in the spec as an optional field to support IMGT's output.

Additional aligned sequence fields to add

Definitely. I was just thinking it'd be nice to have a set of official names for these optional fields because they are so common. They wouldn't be required.

x_sequence_start, x_alignment_start.

The difference is just whether it refers to the position in the unmodified input sequence or the aligned sequence. In the example above, j_alignment_start would be 53, but j_sequence_start would be 55, because the leader sequence (L in regions) has been removed (+5) and gaps have been inserted (-3) in the alignment sequence.

ssnn-airr commented 6 years ago

I basically agree with @schristley

I am for having useful fields specified even if they can be derived
I also prefer to use specific names
I advocate for both _sequence_start and _alignment_start fields to be added
I am ok with the assumption of nt by default and add *_nt wherever needed (juncion_nt)
I would make the _alignments required and maybe use _aln instead of _alignment (sequence_aln, np1_aln, cdr1_aln), to make it shorter?
vj_in_frame better than v_j_in_frame
either chain or chain_type
either stop or stop_codon
add mutated_invariant

I like sequence_input (and sequence_id) better than just sequence, query (and query_id) or query_sequence. As I see it, query_sequence is any sequence that I use to do some comparison to a reference, and it could be the whole input sequence, or maybe a part of the input sequence, or I may be using the aligned sequence as query for some analysis... I see query_sequence as something that can change with the context and sequence input as the 'raw' sequence data. But my opinion is probably biased due to the fact that we currently use SEQUENCE_INPUT. So if you go with query, that's fine.

jianye00 commented 6 years ago

@javh Ok you are essentially saying you want to add additional alignment fields (other than the existing CIGAR field). Since an alignment involves two sequences, there will be two alignment string fields (i.e. AGCT--CGTAGC) for each alignment (One is for the query and the other is for the germline). For each alignment, you also want to add start and stop coordinates (and in both sequence and alignment coordinates), for both query and germline. And of course we have 4 alignment categories (whole vdj, v, d, j). Is that correct? That's lots of fields but I am fine with it if people think that's not too many and such information is useful.

I'd say we could use "v_query_start" and "v_germline_start" to indicate sequence coordinates and "v_query_alignment_start" and "v_germline_alignment_start" to indicate alignment coordinates.

Mutated_invariant could stay but we better indicate which one it is. Stop codon is a little different, it does not matter where it is, you have the same result (no protein). The effect of mutation at different invariant position could be different.

javh commented 6 years ago

@jianye00, yeah, that's the idea. Though, a lot of these fields will just be name reservations for tools that need them and not mandatory output. The list of expected output would just be the things under the required tag - the following:

sequence_id
sequence
rev_comp
productive
alignment (debatable)
germline (debatable)
v_call
d_call
j_call
c_call (debatable)
v_score
d_score
j_score
c_score (debatable)
v_cigar
d_cigar
j_cigar
c_cigar (debatable)
junction_nt
junction_aa (required by minimal standards)

If a tool didn't output j_query_start or j_germline_start then you could just get it from the N/S operators at the front of j_cigar if you needed it, but if a tool did want to output it, we'd have a reserved field name for it.

javh commented 6 years ago

Oh, one thing I didn't notice before... We have germline_database in the required fields.

Isn't this more of a metadata item?

jianye00 commented 6 years ago

Sure that should not be here...typically the databases are fixed for a given run and therefore the values are the same for all queries.

I noticed this two fields: alignment (debatable) germline (debatable)

what are they?

javh commented 6 years ago

I noticed this two fields: alignment (debatable) germline (debatable) what are they?

The full length, aligned query and germline sequences from the "Necessary fields to add" section in my original post.

Edit: Names aren't final. I'm just using what I originally used for simplicity.

jianye00 commented 6 years ago

Ok, then the names need to be a little more informative, something lilke query_alignment, germline_alignment.

jianye00 commented 6 years ago

I'd also want to suggest an additional field. We have seen some user requests to output the translated amino acid sequence for the query (the translation would be from start of V gene in the query) . Currently the translation is available in the standard output but that's not easy to parse. Could this be appropriate for an optional field in AIRR format? Of course we could add it as a custom field but it seems to be of general interest.

javh commented 6 years ago

Yeah, I think that makes sense. It'd just be adding more _aa fields.

javh commented 6 years ago

Pinging @laserson, @williamdlees and @psathyrella for any input.

williamdlees commented 6 years ago

For 'chain type' I suggest adding locus and region as defined in the draft germline schema at https://github.com/airr-community/gldb-wg/blob/master/germline_schema/schema/receptor_germline_schema.yaml

locus: type: [Heavy, Light-Kappa or Light-Lambda for B-cell sequences, or Alpha, Beta or Gamma for T-cell sequences] description: Gene locus region: type: [V,D,J,C] description: Gene region

For the other fields - I agree that it is helpful to define 'useful' fields even if they can be derived. I think quite a few people will find the cigar string difficult to work with. There is no Excel implementation, for example. Having the cdrs and fwrs explicit will be helpful in applications that are less computationally oriented and it makes sense to have reserved field names and explicit definitions.

schristley commented 6 years ago

gene_locus (or v_gene_locus, v_call_locus ?) seems appropriate as mentioned by @williamdlees but maybe not gene_region as the [V,D,J,C] regions are implicitly defined (v_call, etc.). Also gene_locus seems to infer B-cell vs T-cell. Is that sufficient? Accurate for all species?

bussec commented 6 years ago

Sorry I somehow missed this before. A couple of comments:

@schristley : Yes, there are VLR loci in Agnatha, which are rearranged and expressed in various lymphocyte populations in these animals. While these cells seem to exhibit convergent functions, they are not classical B or T cells. IMO we should restrict ourselves to Gnathostoma for now, as the inclusion of VLR will require substantially more changes than the locus.
All the alternative IGH constant regions in non-mammalian species are already covered by locus:heavy. However, the options for locus should also include the Ig sigma and iota light chain loci present in fishes and frogs. Also see this post on b-t.cr.
I think that even though TRD does hide within the TRA locus it should be mentioned in VC (either separate or as TRAD).
Finally, I would strongly opt for renaming region to segment. The usual definition is that the "V region" is the stretch of DNA/RNA that encodes the "V domain" protein structure and is created by the rearrangement of a genomic V, D and J segment. This definition would then also be in line with the current keys used in the INSDC feature table.

jianye00 commented 6 years ago

@williamdlees

It seems to me that the already-defined "locus" is an appropriate replacement of "chain type" and basically convey the same thing. But why the Delta type is missing for TCR (probably also mentioned by @bussec)? I don't seem to see the reason for adding "region" since we are dealing with rearranged sequence, not individual v, d or j element.

javh commented 6 years ago

I'm working on combining everyone's responses into a PR right now... quick question:

Is rev_comp suppose to be True if the content of sequence is reverse complemented w.r.t. the input read, or is sequence suppose to remain in its original orientation and rev_comp indicates the alignment is reversed w.r.t. sequence?

Ie, does rev_comp=True indicate that sequence and junction are in the same or opposite orientations?

schristley commented 6 years ago

or is sequence suppose to remain in its original orientation and rev_comp indicates the alignment is reversed w.r.t. sequence?

That's how I interpret it. The input sequence had to be reverse complemented to be aligned, so sequence and junction (plus all other output fields) are in opposite orientations.

jianye00 commented 6 years ago

The rev_comp should be clearly defined. We have always used a simple approach. If the input read contains vdj sequence in opposite orientation, we first reverse and complement the input read and then use the reversed/complemented read for alignment. Therefore, all output data (i.e., alignment coordinates, nucleotide sequence, translated protein sequence, etc) are based on the reversed/complemented sequence of the original read.

Another thing is that if we want to add alignment fields for the entire vdj alignment, I thought we should also add the corresponding fields for individual v, d, j and c elements since some tools only deal with individual alignments.

javh commented 6 years ago

I added @jianye00's definition for rev_comp to the description after merging #89.

Maybe let's address fields like v_sequence_alignment, v_germline_alignment, etc, that are exact derivatives of CIGAR when we get to the hit table format? For now, I think maybe keep them as custom fields? If they become a need, we can always add them later.

I'm also having mixed feelings about including optional explicit IMGT numbered alignment fields like sequence_imgt and germline_imgt that we use, so I left them out of the current pull request in favor of just using sequence_alignment and germline_alignment for those tools that need that. (Also, to avoid dragging out the discussion, so I could merge #89.)

Closing this for now. If you notice something that I didn't address, please reopen.

airr-community / airr-standards