airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Finalizing Rearrangement fields #85

Closed javh closed 6 years ago

javh commented 6 years ago

I just finished making all the (relevant) Change-O tools work with the AIRR format. Here's the set of spec changes I think are necessary after going through that as well as input from a few other discussions. Some of those discussions were offline, but some of it's covered in the issues: #84, #83, #82, #75, #58.

Before making a pull request with changes, I want to hear others' thoughts as some of these things require naming decisions which I'm undecided on. For all fields listed, I've proposed a non-exhaustive list of alternative names. I tend to prefer short names, but there's a good argument for longer, more explicit names as well.

Fields to modify

rearrangement_id (rearrangment)

rearrangement_set_id (rearrangement_set)

cell_index

sequence (query_sequence, sequence_input)

functional

Necessary fields to add

The following fields are missing, but necessary:

sequence_id (query_id)

alignment (aligned_sequence / sequence_alignment)

germline (germline_sequence)

Optional fields to add

The following missing fields are not necessarily needed, but I think we should add them as they are common aligner output.

chain (chain_type)

stop

vj_in_frame (v_j_in_frame)

mutated_invariant

Additional aligned sequence fields to add

These fields are not required, but they appear in the output of several tools, so I think we should add them as official optional fields.

The big issues here are what to call them, whether this implies a name change to the junction_nt field, and how this defines the names of positional fields (below).

By default, we have been assuming all sequences are nucleotides, so that sequence implies sequence_nt. I think this is easier than having an _nt and _aa field for everything.

There's also an implication that cdr1_sequence is the cdr1 sequence as it appears in the input sequence (sequence field) and cdr1_alignment is the cdr1 as it appears in the output sequence (alignment field). Whereas, cdr1 implies neither, but we can define it as extracted from the alignment field in the description.

Clarify naming of positional fields

Related to the above, maybe we should modify the names of the positional fields so they imply a direct relationship to the sequence field that they correspond to? For example:

Note, I think we actually need both *_sequence_start and *_alignment_start - one for NCBI GenBank submission (*_sequence_start) and one for analysis purposes (*_alignment_start), but I don't think we need to include both in the spec, because we can derive one from the other using the CIGAR string.

schristley commented 6 years ago

Thanks for summarizing this for us @javh.

I'm in the camp of having more fields specified is better. Even a field that can be derived from another, if the field is useful and/or being used by a tool, then I prefer it to be specified. That means I'm basically good with all the additional fields and would also advocate for both *_sequence_start and *_alignment_start fields to be added.

For naming, I also prefer the names to be more specific, the exception being some "core" fields like sequence is fine, it doesn't need to be sequence_input. Especially if a field might have different contexts, like you point out where fields may refer to either the input sequence or the output alignment.

It seems reasonable to assume all sequences are nt and thus use _aa for when it's amino acids. However, I also think it's reasonable to put _nt on the field if it may not necessarily be obvious that it's a sequence. Here I side on being more specific. I'm okay with junction_nt, I also prefer fwr3_nt over just fwr3 or fwr3_sequence. Is there any reason to provide (say) fwr3 as amino acids? Maybe, so fwr3 and fwr3_sequence might be ambiguous while fwr3_nt and fwr3_aa are less so. But I agree with your point that if fwr3 might refer to either the sequence or the alignment, then we need to handle both, in that case fwr3_sequence and fwr3_alignment seems appropriate with the _nt implied.

I would eliminate the x-crwg tag, 1) because it isn't useful to the CRWG API spec nor to the AIRR library, and 2) crwg is probably not the best name. I'm fine with rearrangement_set and rearrangement_set_id not being required for now. It's probably premature to specify these technical details anyways until CRWG is further along.

Lastly, for sequence_id, it might be useful for the AIRR library to generate one if one isn't provided, which could be helpful for tools that depend upon it. Of course, tools could just as easily do it themselves.

jianye00 commented 6 years ago

Here is what I thought.

  1. Under "Fields to modify", everything look fine except for the "sequence" field . I thought we should be strict about names and definitions for required fields so that we have a consistency in what they represent. I'd use "query_sequence" to avoid any ambiguity. As for definition, I thought it should only mean the original query sequence itself (or reversed and complemented as is defined in the original specs). If a tool wants to provide a sequence string of some other kind, it can be provided by a custom field.

  2. "Necessary fields to add". 1) query_id seems to be a better name as reasoned above. 2) alignment (aligned_sequence / sequence_alignment). I am not sure about the purpose of this. Isn't alignment already represented by the existing CIGAR alignment field? 3) germline (germline_sequence). Same as 2) above.

  3. "Optional fields to add" 1) chain (chain_type). If you took this from igblast, this means VH, VK, VL for B cell recepotor, VB, VA, VG, or VD for TCR. I'd prefer "chain_type" since it's a little more clear but I am fine with alternative names.
    2) I'd use stop_codon to be clear. 3) vj_in_frame (v_j_in_frame). vj_in_frame is better since it's more consistent with existing names like np1_seq, etc. 4) mutated_invariant. I am not what it refers to since we have a few different invariant amino acid positions.

  4. "Additional aligned sequence fields to add".
    Since the current specs have CIGAR string to represent alignment already, can't these additional alignment fields could be treated as custom fields and let the tools themselves to define?

  5. "Clarify naming of positional fields".
    I'd use "v_query_start" and "v_germline_start", etc to be clear. v_query_start indicates the first position in the query sequence that is aligned to the germline V sequence. I am not sure what you meant by referring to both sequence_start and alignment_start.

javh commented 6 years ago

Just wanted to address @jianye00's questions. (No opinions to offer yet.)

alignment, germline

This is partially addressed by the CIGAR string, yes, but not entirely because the CIGAR strings are by segment. Though, vdj_cigar could cover it, with masking for N/P regions. Not sure if that would conflict with how the alignment process works for some tools.

A simple example of what these two fields look like would be:

 sequence> CCCCCCACTGGNAGTTGACGAAGAGGGGCTGATTATTAGCTTCAGTCGTATGACGGGGACCAACGTCAGGGGGG
alignment>      CACTGGNAGTTGACGAAGAGGGGCTGATTATTA---GCTTCAGTCGTATGACGGGGACCAACGTCA
 germline>      CACTGGGAGCTGAGGATGA--GGCTGATTATTAAGTGGTTNNNNCGTGTGANNGGGACCAAGGTCA
   region> LLLLLVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVNNNNDDDDDDDNNJJJJJJJJJJJJJCCCCCC

Basically, the alignment field is just the concatenated top V hit query seq, N/P1, D hit query seq, N/P2, and J hit query seq in the IgBLAST output (to make a full length aligned, observed sequence). The germline field is then the corresponding full length inferred, aligned, germline sequence made using the subject seq fields instead.

These two fields are key analysis fields for us, because we do a lot of direct string comparisons between the full length observed and germline V(D)J sequences. While we should be able to make these from the CIGAR strings and N/P positions, that just adds an extra processing step before we can actually use the file for analysis.

mutated_invariant

This comes from the IMGT/HighV-QUEST output and how they define productive. If any invariant residue is either mutated or absent, then they note it. It's kind of like the stop field, which says a stop codon is found, but not where or how many. Though, maybe it would be better as string specifying the invariant. Eg, C104P being cystine 104 to proline.

I don't think it's a necessary field. Certainly not something I think IgBLAST needs to output. It'd just be present in the spec as an optional field to support IMGT's output.

Additional aligned sequence fields to add

Definitely. I was just thinking it'd be nice to have a set of official names for these optional fields because they are so common. They wouldn't be required.

x_sequence_start, x_alignment_start.

The difference is just whether it refers to the position in the unmodified input sequence or the aligned sequence. In the example above, j_alignment_start would be 53, but j_sequence_start would be 55, because the leader sequence (L in regions) has been removed (+5) and gaps have been inserted (-3) in the alignment sequence.

ssnn-airr commented 6 years ago

I basically agree with @schristley

I like sequence_input (and sequence_id) better than just sequence, query (and query_id) or query_sequence. As I see it, query_sequence is any sequence that I use to do some comparison to a reference, and it could be the whole input sequence, or maybe a part of the input sequence, or I may be using the aligned sequence as query for some analysis... I see query_sequence as something that can change with the context and sequence input as the 'raw' sequence data. But my opinion is probably biased due to the fact that we currently use SEQUENCE_INPUT. So if you go with query, that's fine.

jianye00 commented 6 years ago

@javh Ok you are essentially saying you want to add additional alignment fields (other than the existing CIGAR field). Since an alignment involves two sequences, there will be two alignment string fields (i.e. AGCT--CGTAGC) for each alignment (One is for the query and the other is for the germline). For each alignment, you also want to add start and stop coordinates (and in both sequence and alignment coordinates), for both query and germline. And of course we have 4 alignment categories (whole vdj, v, d, j). Is that correct? That's lots of fields but I am fine with it if people think that's not too many and such information is useful.

I'd say we could use "v_query_start" and "v_germline_start" to indicate sequence coordinates and "v_query_alignment_start" and "v_germline_alignment_start" to indicate alignment coordinates.

Mutated_invariant could stay but we better indicate which one it is. Stop codon is a little different, it does not matter where it is, you have the same result (no protein). The effect of mutation at different invariant position could be different.

javh commented 6 years ago

@jianye00, yeah, that's the idea. Though, a lot of these fields will just be name reservations for tools that need them and not mandatory output. The list of expected output would just be the things under the required tag - the following:

If a tool didn't output j_query_start or j_germline_start then you could just get it from the N/S operators at the front of j_cigar if you needed it, but if a tool did want to output it, we'd have a reserved field name for it.

javh commented 6 years ago

Oh, one thing I didn't notice before... We have germline_database in the required fields.

Isn't this more of a metadata item?

jianye00 commented 6 years ago

Sure that should not be here...typically the databases are fixed for a given run and therefore the values are the same for all queries.

I noticed this two fields: alignment (debatable) germline (debatable)

what are they?

javh commented 6 years ago

I noticed this two fields: alignment (debatable) germline (debatable) what are they?

The full length, aligned query and germline sequences from the "Necessary fields to add" section in my original post.

Edit: Names aren't final. I'm just using what I originally used for simplicity.

jianye00 commented 6 years ago

Ok, then the names need to be a little more informative, something lilke query_alignment, germline_alignment.

jianye00 commented 6 years ago

I'd also want to suggest an additional field. We have seen some user requests to output the translated amino acid sequence for the query (the translation would be from start of V gene in the query) . Currently the translation is available in the standard output but that's not easy to parse. Could this be appropriate for an optional field in AIRR format? Of course we could add it as a custom field but it seems to be of general interest.

javh commented 6 years ago

Yeah, I think that makes sense. It'd just be adding more _aa fields.

javh commented 6 years ago

Pinging @laserson, @williamdlees and @psathyrella for any input.

williamdlees commented 6 years ago

For 'chain type' I suggest adding locus and region as defined in the draft germline schema at https://github.com/airr-community/gldb-wg/blob/master/germline_schema/schema/receptor_germline_schema.yaml

locus: type: [Heavy, Light-Kappa or Light-Lambda for B-cell sequences, or Alpha, Beta or Gamma for T-cell sequences] description: Gene locus region: type: [V,D,J,C] description: Gene region

For the other fields - I agree that it is helpful to define 'useful' fields even if they can be derived. I think quite a few people will find the cigar string difficult to work with. There is no Excel implementation, for example. Having the cdrs and fwrs explicit will be helpful in applications that are less computationally oriented and it makes sense to have reserved field names and explicit definitions.

schristley commented 6 years ago

gene_locus (or v_gene_locus, v_call_locus ?) seems appropriate as mentioned by @williamdlees but maybe not gene_region as the [V,D,J,C] regions are implicitly defined (v_call, etc.). Also gene_locus seems to infer B-cell vs T-cell. Is that sufficient? Accurate for all species?

bussec commented 6 years ago

Sorry I somehow missed this before. A couple of comments:

jianye00 commented 6 years ago

@williamdlees

It seems to me that the already-defined "locus" is an appropriate replacement of "chain type" and basically convey the same thing. But why the Delta type is missing for TCR (probably also mentioned by @bussec)? I don't seem to see the reason for adding "region" since we are dealing with rearranged sequence, not individual v, d or j element.

javh commented 6 years ago

I'm working on combining everyone's responses into a PR right now... quick question:

Is rev_comp suppose to be True if the content of sequence is reverse complemented w.r.t. the input read, or is sequence suppose to remain in its original orientation and rev_comp indicates the alignment is reversed w.r.t. sequence?

Ie, does rev_comp=True indicate that sequence and junction are in the same or opposite orientations?

schristley commented 6 years ago

or is sequence suppose to remain in its original orientation and rev_comp indicates the alignment is reversed w.r.t. sequence?

That's how I interpret it. The input sequence had to be reverse complemented to be aligned, so sequence and junction (plus all other output fields) are in opposite orientations.

jianye00 commented 6 years ago

The rev_comp should be clearly defined. We have always used a simple approach. If the input read contains vdj sequence in opposite orientation, we first reverse and complement the input read and then use the reversed/complemented read for alignment. Therefore, all output data (i.e., alignment coordinates, nucleotide sequence, translated protein sequence, etc) are based on the reversed/complemented sequence of the original read.

Another thing is that if we want to add alignment fields for the entire vdj alignment, I thought we should also add the corresponding fields for individual v, d, j and c elements since some tools only deal with individual alignments.

javh commented 6 years ago

I added @jianye00's definition for rev_comp to the description after merging #89.

Maybe let's address fields like v_sequence_alignment, v_germline_alignment, etc, that are exact derivatives of CIGAR when we get to the hit table format? For now, I think maybe keep them as custom fields? If they become a need, we can always add them later.

I'm also having mixed feelings about including optional explicit IMGT numbered alignment fields like sequence_imgt and germline_imgt that we use, so I left them out of the current pull request in favor of just using sequence_alignment and germline_alignment for those tools that need that. (Also, to avoid dragging out the discussion, so I could merge #89.)

Closing this for now. If you notice something that I didn't address, please reopen.