Closed javh closed 6 years ago
Thanks for summarizing this for us @javh.
I'm in the camp of having more fields specified is better. Even a field that can be derived from another, if the field is useful and/or being used by a tool, then I prefer it to be specified. That means I'm basically good with all the additional fields and would also advocate for both *_sequence_start
and *_alignment_start
fields to be added.
For naming, I also prefer the names to be more specific, the exception being some "core" fields like sequence
is fine, it doesn't need to be sequence_input
. Especially if a field might have different contexts, like you point out where fields may refer to either the input sequence or the output alignment.
It seems reasonable to assume all sequences are nt and thus use _aa
for when it's amino acids. However, I also think it's reasonable to put _nt
on the field if it may not necessarily be obvious that it's a sequence. Here I side on being more specific. I'm okay with junction_nt
, I also prefer fwr3_nt
over just fwr3
or fwr3_sequence
. Is there any reason to provide (say) fwr3 as amino acids? Maybe, so fwr3
and fwr3_sequence
might be ambiguous while fwr3_nt
and fwr3_aa
are less so. But I agree with your point that if fwr3
might refer to either the sequence or the alignment, then we need to handle both, in that case fwr3_sequence
and fwr3_alignment
seems appropriate with the _nt
implied.
I would eliminate the x-crwg
tag, 1) because it isn't useful to the CRWG API spec nor to the AIRR library, and 2) crwg
is probably not the best name. I'm fine with rearrangement_set
and rearrangement_set_id
not being required
for now. It's probably premature to specify these technical details anyways until CRWG is further along.
Lastly, for sequence_id
, it might be useful for the AIRR library to generate one if one isn't provided, which could be helpful for tools that depend upon it. Of course, tools could just as easily do it themselves.
Here is what I thought.
Under "Fields to modify", everything look fine except for the "sequence" field . I thought we should be strict about names and definitions for required fields so that we have a consistency in what they represent. I'd use "query_sequence" to avoid any ambiguity. As for definition, I thought it should only mean the original query sequence itself (or reversed and complemented as is defined in the original specs). If a tool wants to provide a sequence string of some other kind, it can be provided by a custom field.
"Necessary fields to add". 1) query_id seems to be a better name as reasoned above. 2) alignment (aligned_sequence / sequence_alignment). I am not sure about the purpose of this. Isn't alignment already represented by the existing CIGAR alignment field? 3) germline (germline_sequence). Same as 2) above.
"Optional fields to add"
1) chain (chain_type). If you took this from igblast, this means VH, VK, VL for B cell recepotor, VB, VA, VG, or VD for TCR. I'd prefer "chain_type" since it's a little more clear but I am fine with alternative names.
2) I'd use stop_codon to be clear.
3) vj_in_frame (v_j_in_frame). vj_in_frame is better since it's more consistent with existing names like np1_seq, etc.
4) mutated_invariant. I am not what it refers to since we have a few different invariant amino acid positions.
"Additional aligned sequence fields to add".
Since the current specs have CIGAR string to represent alignment already, can't these additional alignment fields could be treated as custom fields and let the tools themselves to define?
"Clarify naming of positional fields".
I'd use "v_query_start" and "v_germline_start", etc to be clear. v_query_start indicates the first position in the query sequence that is aligned to the germline V sequence. I am not sure what you meant by referring to both sequence_start and alignment_start.
Just wanted to address @jianye00's questions. (No opinions to offer yet.)
alignment, germline
This is partially addressed by the CIGAR string, yes, but not entirely because the CIGAR strings are by segment. Though, vdj_cigar
could cover it, with masking for N/P regions. Not sure if that would conflict with how the alignment process works for some tools.
A simple example of what these two fields look like would be:
sequence> CCCCCCACTGGNAGTTGACGAAGAGGGGCTGATTATTAGCTTCAGTCGTATGACGGGGACCAACGTCAGGGGGG
alignment> CACTGGNAGTTGACGAAGAGGGGCTGATTATTA---GCTTCAGTCGTATGACGGGGACCAACGTCA
germline> CACTGGGAGCTGAGGATGA--GGCTGATTATTAAGTGGTTNNNNCGTGTGANNGGGACCAAGGTCA
region> LLLLLVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVNNNNDDDDDDDNNJJJJJJJJJJJJJCCCCCC
Basically, the alignment
field is just the concatenated top V hit query seq
, N/P1, D hit query seq
, N/P2, and J hit query seq
in the IgBLAST output (to make a full length aligned, observed sequence). The germline
field is then the corresponding full length inferred, aligned, germline sequence made using the subject seq
fields instead.
These two fields are key analysis fields for us, because we do a lot of direct string comparisons between the full length observed and germline V(D)J sequences. While we should be able to make these from the CIGAR strings and N/P positions, that just adds an extra processing step before we can actually use the file for analysis.
mutated_invariant
This comes from the IMGT/HighV-QUEST output and how they define productive. If any invariant residue is either mutated or absent, then they note it. It's kind of like the stop
field, which says a stop codon is found, but not where or how many. Though, maybe it would be better as string specifying the invariant. Eg, C104P
being cystine 104 to proline.
I don't think it's a necessary field. Certainly not something I think IgBLAST needs to output. It'd just be present in the spec as an optional field to support IMGT's output.
Additional aligned sequence fields to add
Definitely. I was just thinking it'd be nice to have a set of official names for these optional fields because they are so common. They wouldn't be required.
x_sequence_start, x_alignment_start.
The difference is just whether it refers to the position in the unmodified input sequence or the aligned sequence. In the example above, j_alignment_start
would be 53, but j_sequence_start
would be 55, because the leader sequence (L in regions) has been removed (+5) and gaps have been inserted (-3) in the alignment
sequence.
I basically agree with @schristley
I like sequence_input (and sequence_id) better than just sequence, query (and query_id) or query_sequence. As I see it, query_sequence is any sequence that I use to do some comparison to a reference, and it could be the whole input sequence, or maybe a part of the input sequence, or I may be using the aligned sequence as query for some analysis... I see query_sequence as something that can change with the context and sequence input as the 'raw' sequence data. But my opinion is probably biased due to the fact that we currently use SEQUENCE_INPUT. So if you go with query, that's fine.
@javh Ok you are essentially saying you want to add additional alignment fields (other than the existing CIGAR field). Since an alignment involves two sequences, there will be two alignment string fields (i.e. AGCT--CGTAGC) for each alignment (One is for the query and the other is for the germline). For each alignment, you also want to add start and stop coordinates (and in both sequence and alignment coordinates), for both query and germline. And of course we have 4 alignment categories (whole vdj, v, d, j). Is that correct? That's lots of fields but I am fine with it if people think that's not too many and such information is useful.
I'd say we could use "v_query_start" and "v_germline_start" to indicate sequence coordinates and "v_query_alignment_start" and "v_germline_alignment_start" to indicate alignment coordinates.
Mutated_invariant could stay but we better indicate which one it is. Stop codon is a little different, it does not matter where it is, you have the same result (no protein). The effect of mutation at different invariant position could be different.
@jianye00, yeah, that's the idea. Though, a lot of these fields will just be name reservations for tools that need them and not mandatory output. The list of expected output would just be the things under the required
tag - the following:
If a tool didn't output j_query_start
or j_germline_start
then you could just get it from the N/S operators at the front of j_cigar
if you needed it, but if a tool did want to output it, we'd have a reserved field name for it.
Oh, one thing I didn't notice before... We have germline_database
in the required
fields.
Isn't this more of a metadata item?
Sure that should not be here...typically the databases are fixed for a given run and therefore the values are the same for all queries.
I noticed this two fields: alignment (debatable) germline (debatable)
what are they?
I noticed this two fields: alignment (debatable) germline (debatable) what are they?
The full length, aligned query and germline sequences from the "Necessary fields to add" section in my original post.
Edit: Names aren't final. I'm just using what I originally used for simplicity.
Ok, then the names need to be a little more informative, something lilke query_alignment, germline_alignment.
I'd also want to suggest an additional field. We have seen some user requests to output the translated amino acid sequence for the query (the translation would be from start of V gene in the query) . Currently the translation is available in the standard output but that's not easy to parse. Could this be appropriate for an optional field in AIRR format? Of course we could add it as a custom field but it seems to be of general interest.
Yeah, I think that makes sense. It'd just be adding more _aa
fields.
Pinging @laserson, @williamdlees and @psathyrella for any input.
For 'chain type' I suggest adding locus and region as defined in the draft germline schema at https://github.com/airr-community/gldb-wg/blob/master/germline_schema/schema/receptor_germline_schema.yaml
locus: type: [Heavy, Light-Kappa or Light-Lambda for B-cell sequences, or Alpha, Beta or Gamma for T-cell sequences] description: Gene locus region: type: [V,D,J,C] description: Gene region
For the other fields - I agree that it is helpful to define 'useful' fields even if they can be derived. I think quite a few people will find the cigar string difficult to work with. There is no Excel implementation, for example. Having the cdrs and fwrs explicit will be helpful in applications that are less computationally oriented and it makes sense to have reserved field names and explicit definitions.
gene_locus
(or v_gene_locus
, v_call_locus
?) seems appropriate as mentioned by @williamdlees but maybe not gene_region
as the [V,D,J,C] regions are implicitly defined (v_call
, etc.). Also gene_locus
seems to infer B-cell vs T-cell. Is that sufficient? Accurate for all species?
Sorry I somehow missed this before. A couple of comments:
locus
:heavy
. However, the options for locus
should also include the Ig sigma and iota light chain loci present in fishes and frogs. Also see this post on b-t.cr.region
to segment
. The usual definition is that the "V region" is the stretch of DNA/RNA that encodes the "V domain" protein structure and is created by the rearrangement of a genomic V, D and J segment. This definition would then also be in line with the current keys used in the INSDC feature table.@williamdlees
It seems to me that the already-defined "locus" is an appropriate replacement of "chain type" and basically convey the same thing. But why the Delta type is missing for TCR (probably also mentioned by @bussec)? I don't seem to see the reason for adding "region" since we are dealing with rearranged sequence, not individual v, d or j element.
I'm working on combining everyone's responses into a PR right now... quick question:
Is rev_comp
suppose to be True
if the content of sequence
is reverse complemented w.r.t. the input read, or is sequence
suppose to remain in its original orientation and rev_comp
indicates the alignment is reversed w.r.t. sequence
?
Ie, does rev_comp=True
indicate that sequence
and junction
are in the same or opposite orientations?
or is sequence suppose to remain in its original orientation and rev_comp indicates the alignment is reversed w.r.t. sequence?
That's how I interpret it. The input sequence had to be reverse complemented to be aligned, so sequence
and junction
(plus all other output fields) are in opposite orientations.
The rev_comp should be clearly defined. We have always used a simple approach. If the input read contains vdj sequence in opposite orientation, we first reverse and complement the input read and then use the reversed/complemented read for alignment. Therefore, all output data (i.e., alignment coordinates, nucleotide sequence, translated protein sequence, etc) are based on the reversed/complemented sequence of the original read.
Another thing is that if we want to add alignment fields for the entire vdj alignment, I thought we should also add the corresponding fields for individual v, d, j and c elements since some tools only deal with individual alignments.
I added @jianye00's definition for rev_comp
to the description after merging #89.
Maybe let's address fields like v_sequence_alignment
, v_germline_alignment
, etc, that are exact derivatives of CIGAR when we get to the hit table format? For now, I think maybe keep them as custom fields? If they become a need, we can always add them later.
I'm also having mixed feelings about including optional explicit IMGT numbered alignment fields like sequence_imgt
and germline_imgt
that we use, so I left them out of the current pull request in favor of just using sequence_alignment
and germline_alignment
for those tools that need that. (Also, to avoid dragging out the discussion, so I could merge #89.)
Closing this for now. If you notice something that I didn't address, please reopen.
I just finished making all the (relevant) Change-O tools work with the AIRR format. Here's the set of spec changes I think are necessary after going through that as well as input from a few other discussions. Some of those discussions were offline, but some of it's covered in the issues: #84, #83, #82, #75, #58.
Before making a pull request with changes, I want to hear others' thoughts as some of these things require naming decisions which I'm undecided on. For all fields listed, I've proposed a non-exhaustive list of alternative names. I tend to prefer short names, but there's a good argument for longer, more explicit names as well.
Fields to modify
rearrangement_id (rearrangment)
required
.x-crwg: true
to denote the field as required in the CRWG context.rearrangement_set_id (rearrangement_set)
required
.x-crwg: true
to denote the field as required in the CRWG context.cell_index
required
.type
tostring
.sequence (query_sequence, sequence_input)
alignment
field (below) was based on that meets the NCBI GenBank submission criteria: "an uncorrected sequence, which may have gone through assembly or consensus, but not indel correction or truncation".functional
productive
.Necessary fields to add
The following fields are missing, but necessary:
sequence_id (query_id)
sequence
.rearrangement_id
as the row key for the Rearragenment object, within the context of an analysis file (the default context). The default expectation is that this would be equal to the input sequence header, but that is not strictly required, especially in cases wheresequence
corresponds to such an above described intermediate sequence.rearrangment_id
andsequence_id
have conflicting semantics, depending upon whether we are in an analysis file or database context.alignment (aligned_sequence / sequence_alignment)
sequence
in that in may include gaps, deletions, masking or trimming so that it is aligned against the reference sequence in thegermline
field (below). Some examples of what this field looks like areSEQUENCE_VDJ
/SEQUENCE_IMGT
from changeo,V-D-J-REGION
in the 2_IMGT-gapped-nt-sequences and 3_Nt-sequences files from IMGT/HighV-QUEST, themature_seq
field in partis.required
field because it's a primary analysis field. However, it's not universal output. For example, IgBLAST doesn't (presently) output this field for the full V(D)J. Instead, it has individual alignment output in thequery seq
column of the V/D/J hit table section.v_alignment
,d_alignment
, andj_alignment
fields.germline (germline_sequence)
alignment
field. Some examples of what this field looks like areGERMLINE_VDJ
/GERMLINE_IMGT
from changeo and thenaive_seq
field in partis.alignment
, IgBLAST doesn't provide a full length alignment for the germline, but does provide individual V/D/J aligned germlines via thesubject seq
column which we could support viav_germline
,d_germline
, andj_germline
fields.required
to compel tool writers to provide it.Optional fields to add
The following missing fields are not necessarily needed, but I think we should add them as they are common aligner output.
chain (chain_type)
stop
vj_in_frame (v_j_in_frame)
mutated_invariant
Additional aligned sequence fields to add
These fields are not required, but they appear in the output of several tools, so I think we should add them as official optional fields.
The big issues here are what to call them, whether this implies a name change to the
junction_nt
field, and how this defines the names of positional fields (below).By default, we have been assuming all sequences are nucleotides, so that
sequence
impliessequence_nt
. I think this is easier than having an_nt
and_aa
field for everything.There's also an implication that
cdr1_sequence
is the cdr1 sequence as it appears in the input sequence (sequence
field) andcdr1_alignment
is the cdr1 as it appears in the output sequence (alignment
field). Whereas,cdr1
implies neither, but we can define it as extracted from thealignment
field in the description.Clarify naming of positional fields
Related to the above, maybe we should modify the names of the positional fields so they imply a direct relationship to the sequence field that they correspond to? For example:
v_start
tov_sequence_start
orv_alignment_start
v_end
tov_sequence_end
orv_alignment_end
v_germ_start
tov_germline_start
v_germ_end
tov_germline_end
d_start
tod_sequence_start
ord_alignment_start
d_end
tod_sequence_end
ord_alignment_end
d_germ_start
tod_germline_start
d_germ_end
tod_germline_end
j_start
toj_sequence_start
orj_alignment_start
j_end
toj_sequence_end
orj_alignment_end
j_germ_start
toj_germline_start
j_germ_end
toj_germline_end
Note, I think we actually need both
*_sequence_start
and*_alignment_start
- one for NCBI GenBank submission (*_sequence_start
) and one for analysis purposes (*_alignment_start
), but I don't think we need to include both in the spec, because we can derive one from the other using the CIGAR string.