Closed laserson closed 7 years ago
@matsen Starting with the assumption that we want a simple logic for basic analysis tools to perform calculations on the "primary" annotations and ignore everything else. Let's say that basic logic is:
if rank != 1 then ignore row
with the further assumption that all rows with rank==1 will have unique read identifiers, essentially corresponding to the input read sequences (duplicate sequences may be collapsed).
This is a valid file:
id annot_grp rank V_call read1 annotA 1 v3-23 read2 annotB 1 V4-301 read2 annotB 2 V4-302 read1 annotC 2 v3-23 read2 annotC 3 V1-69 read1 annotC 3 v3-23 read2 annotC 4 V1-69
Now how to handle some tools that wants to group reads together and provide alternative annotations? An initial file like in Uri's above post is problematic, because the logic of rank==1 no longer holds. Tools would then have to consider the annotation group, the issue is that "basic analysis tools" won't know which group to use.
This file, however, is valid, with the idea that read1 actually has its "primary" annotation in one of the groups:
id annot_grp rank V_call read1 annotA 2 v3-23 read2 annotB 1 V4-301 read2 annotB 2 V4-302 read1 annotC 1 v3-23 read2 annotC 3 V1-69 read1 annotC 3 v3-23 read2 annotC 4 V1-69
Is this a valid concept for the tools which are grouping reads? Maybe those groups are "disjoint" from each other and they want to reuse the rank number within each group.
Another idea is to extend the logic slightly by checking a "special" annotation group which is essentially the null group:
if rank != 1 or annot_grp is not null then ignore row
id annot_grp rank V_call read1 null 1 v3-23 read2 null 1 V4-301 read2 annotB 2 V4-302 read1 annotC 1 v3-23 read2 annotC 3 V1-69 read1 annotC 3 v3-23 read2 annotC 4 V1-69
Another idea is for ranks to be unique across groups, but ordered within each group, then the basic rank==1 logic still applies:
id annot_grp rank V_call read1 annotA 1 v3-23 read2 annotB 1 V4-301 read2 annotB 2 V4-302 read1 annotC 2 v3-23 read2 annotC 3 V1-69 read1 annotC 3 v3-23 read2 annotC 4 V1-69
An open question is for a tool that produces an output file with read groups, what would be expected correct output for a "basic analysis tool" that processes that file but has not concept of groups?
Erick's asked me to respond, and unfortunately I was travelling and missed the last call, so hopefully that's why I'm spectacularly confused. I'm guessing that a "read group" is a grouping of different reads that are simultaneously aligned? But then it looks like we're still indexing them by a single read's id? So I must be missing something.
Sorry for the confusion.
This may not be relevant, since I understand that there were other reasons folks didn't like it, but the formats I use/the pull request I submitted that handles multi-sequence annotations is transparent for non-multi seq users, there's actually not really any way to tell it handles multi-seq annotations if you're looking at a file that only has single-sequence annotations.
What was discussed on the call is that a VDJ assignment tool might group a set of reads together and annotate that group, partis was given as an example. It was suggested that the "id" would go from a single id to a list of id's, but this was rejected as it would break the semantics for the standard. Instead, we discussed keeping a single id per row, but then have an additional column which defined an "annot_grp" which could be used to group reads. An id could then appear as multiple rows where each one is a different group. The question becomes how can we let tools, which don't have the concept of read groups, still be able operate on the file with some simple logic and achieve the same results.
ah, great, thanks for the explanation. I guess I was understanding it, then.
I think the problem we're encountering is that taking a single-sequence annotation line and adding a column describing other sequences with which it is grouped doesn't really do very much to describe the underlying biology.
The problem is that a number of annotation variables (gene calls, deletion and insertion lengths) are, biologically, associated with a rearrangement event, and thus all sequences that we think stem from that event, and not with a single sequence. A number of other annotation variables (mature sequence, functionality info, mutation info, shm indel info... ) are instead associated with individual sequences. Any wet lab biologist that's spending time looking at bcr alignments is already taking account of this in their head just by taking the rough consensus of several sequences that they think are clonal, because they know that their annotation will be much more accurate, i.e. in effect they're doing multi-sequence annotation.
In other words, I don't think adding a column like this really accomplishes the goals of supporting multi-sequence annotation. As you correctly point out, we're just smooshing a little multi-sequence info onto what is fundamentally a single-sequence format, which results in kludges and would require overly complex code.
Hrm. I'm inclined to agree with @psathyrella. I feel like we've regressed back to the concept of trying to cram a hit table into the analysis table, which I don't see working well.
Instead, we can have these rank and annotation group columns in a separate hit table file, as this information is either derived from, or the basis of, annotation scores/probabilities.
That's a good point. It gets after the difference between the actual underlying biological event and an observation (which there could be many) of an event. I think the WG standard has been focused on the latter, not the former, as the latter is more well-defined being a sequence coming off a sequencing machine, while the former is an inference which might be defined numerous ways.
Does it seem reasonable to you then that the WG standard focuses purely on the latter, or do you think the standard really should be annotation of a "biological rearrangement" with all of its associated evidence?
In my opinion, it would be best if we endorse task-specific formats sparingly, but focus on a few common cases. Like:
I see (2) and (3) more like log files. Something I'd look at once, but wouldn't use for analysis unless there was a problem with (1).
Does it seem reasonable to you then that the WG standard focuses purely on the latter, or do you think the standard really should be annotation of a "biological rearrangement" with all of its associated evidence?
I don't think that we need to choose one or the other. Representing biological rearrangements in this case is strictly more general than representing sequences. Single-sequence annotation is just biological-rearrangement annotation where we don't leverage the information that can be gained by considering multiple sequences at once.
taking a single-sequence annotation line and adding a column describing other sequences with which it is grouped doesn't really do very much to describe the underlying biology
I'm not sure I understand why this is the case. Taking a set of sequences and saying they all belong to the same group is exactly like adding a column called rearrangement_id
and giving all those sequences a unique identifier in that column, no?
we're just smooshing a little multi-sequence info onto what is fundamentally a single-sequence format
This seems perfectly valid to me, as long as the real underlying object is not a tree, in which we really will lose information. The "kludgy" part is that any analysis that wants to account for the joint annotation must first perform a group-by operation on this data set, which in could in theory be expensive. (But for non-distributed systems, would in practice be easy bc all the reads in a read group would be physically next to each other.)
we've regressed back to the concept of trying to cram a hit table into the analysis table, which I don't see working well.
This is a compromise we made as an alternative to endorsing 2 files: the "truth"/"primary" file, and the "secondary" file with all output alignments. I'm personally not against endorsing the use of multiple files. As you describe later, the different types of files really do correspond to different "types", so perhaps that makes more sense.
Representing biological rearrangements in this case is strictly more general than representing sequences.
+1
I think @javh summarized some of our proposed "types" pretty well. It seems we were trying to go down a path that combined his truth and hit table into one, but that may be too complicated if we want to support things like joint calling (and I think that we do). Since we're already kind of agreeing that we need a separate file format for representing clonality, perhaps it is natural to have a "hit" format as well? Ideally, it should be easy to go from such a hit table to the truth table unambiguously and reproducibly.
Using IgBLAST as an example, it does separate alignments for the V, D and J segments. If you just wanted the top 5 hits of each in the file it could get unwieldy quickly. Do you put the top V, D and J hit all in one row and assign them rank 1? Then what do you do with rank 2? Have a rank 2 for each of V, D and J with NA
in the other columns? Or do you do all possible V, D, J combinations and come up with some weighting scheme for the per-segment alignment scores to rank each V, D, J combination somehow?
Anyway, that's why I'm thinking the one file to rule them all won't work.
@laserson the things I was meaning that are missed if we just add a column to a single-sequence annotation row are all the inherently per-sequence variables (off the top of my head, mature/input sequence, mutation info, and functionality info). i.e. if it's a single-sequence row it only has one mature/input sequence, but then if the extra column says it's an annotation for several other ids as well, where does the parser go to get those other sequences?
And it seems like the problem Scott pointed out at the top is a symptom of this -- the fact that it's basically a single-sequence row means that the row is telling us "I'm basically about this single id in the first column", but then later it tells us "oh, wait, there's the other ids I'm about, too, and you might have to look down another few rows to decide which one to use".
@psathyrella So do you have a potential solution? Up above I ran through example about what the content might look like. What is wrong and how would you change?
I mean I have the format that I use, i.e. the pull request, but I don't have a solution for your specific problem up top. What I was meaning by saying the paradigm of adding a column to a single-seq format not accomplishing the goals of a multi-seq format was that it didn't seem like a good idea to significantly complicate the single-seq format in order to have a multi-seq format that doesn't really work.
In other words, what are the constraints? All I know is what I'm using is off the table, but I don't know which bits of it I'm allowed to keep.
Okay, I think I understand your viewpoint now and it's valid. The proposal for multi-seq by using multiple rows does have the issue that many annotations fields are de-normalized and duplicated across those rows, which can lead to inconsistencies.
But one point about the @psathyrella push request proposal. We debated these "multi-entry" columns a lot back at the beginning of the WG, and there was a pretty strong consensus that we needed to avoid them. They introduce so many problems with delimiter usage, escaping, quoting, etc. Any defaults we pick will not work for some sequence somewhere, so we don't want to put that in the standard.
Now that said, we cannot prevent some tool from doing that if they want, and it is actually reasonable when there is a controlled vocabulary for the values (e.g. gene names, computer generated indexes).
@javh brings up two really good points. The IgBlast provides individual alignments for V,D,J and multiple of them as well. VDJML spec is a more proper representation in this regards because it stores all of those alignments. Our repsum tool picks the highest scoring combo and that's what it reports as the rank=1 rearrangement. ChangeO's MakeDb probably does something similar. The other point being there isn't one file to rule them all, and VDJML is for alignment, it isn't a rearrangements file.
So it sounds like I've come to the same conclusion that everybody else has already reached in that we are trying to munge different types into a single file, and it's not gonna work. Since we've pretty much ruled out a file format that could support combining multiple types by sticking with TSV, it seems like our only choice is to go to a multiple file route?
if it's a single-sequence row it only has one mature/input sequence, but then if the extra column says it's an annotation for several other ids as well, where does the parser go to get those other sequences?
That's what you do the "group-by" operation for. Then it has access to all the sequences that were jointly annotated. Does this not precisely solve your issue?
there was a pretty strong consensus that we needed to avoid [multiple IDs per field, i.e., nested data]
I think this may be the most crucial feature to make this format easy to use and scalable to larger data sets and multiple query engines.
@javh: in your example, our current view was that there would be null values in the V, D, or J columns as appropriate, and any per-sequence annotations would/could be replicated in each row. But I agree that this can be janky, esp. in light of joint-annotation.
@laserson oh, I see what you mean. Yes, that addresses what I was saying -- so we'd be enforcing that there'd be a line somewhere further down for every sequence that we wanted to be clonal. I had been assuming that we would maintain some level of independence between lines. But, yeah, if we're starting from the assumption that we will limit our special/forbidden character set to the two common non-printable characters (\t
and \n
), then pretty clearly if we want to represent any kind of structure we're going to be leaking information between lines.
I think that increasing the forbidden character set is a reasonable price to pay for maintaining independence between lines. Introducing significant line-interdependence makes parsing more complicated, and thus error-prone. Forbidding characters, otoh, is trivial to implement -- you just provide some translation -- but if you're the person with the data set that happens to have the forbidden characters in the ids, it's definitely annoying. But, anyway, I definitely accept that judgements of the relative merits of the two choices are entirely normative.
I understand the attraction of collapsing records using nested delimiters, but I really think this will do more harm than good in the long run.
Limiting yourself purely to rows/columns essentially corresponds to the decades-old relational data model, for which there exist tons of database engines with many different features, along with already-implemented parsers in pretty much every single language. You can almost trivially load the data into multiple SQL engines, along with many Hadoop tools without doing any additional processing. You can trivially load the data into dataframes in R or Python, which are the most commonly used structures for analytics. Performing data flow operations on relational data is very well understood, including operations like joins, filters, groupings, aggregations. There exist lots of visualization tools that can automatically generate tons of useful figures from relational data.
In contrast, allowing additional delimiters essentially creates nested records, which breaks the relational model and for which there is no obvious standard for querying. It means that some fields will be strings that must be parsed, meaning that it's harder to use predicates/aggregations/filters against them. It also means that parsers must be implemented manually for the file format, which is a huge source of errors. It also means that it's harder to perform type checking since more fields will be strings. Finally, some people don't get the memo about forbidden characters and then your whole parser is broken.
Many cases in which you want to have nested records really means that you have multiple tables with some kind of join relationship in a relational model. This is exactly what @javh has pointed out, where we may be better off with, for example, an alignments file and an annotated-rearrangement file, or something like that.
That's why I think "restricting" ourselves to the most vanilla formats (TSV and YAML) will actually be the most future-proof and flexible thing we can do. The hard part for us is then purely in the data modeling, rather than the data modeling AND the parsing/implementation.
I agree with @laserson. The only thing I'd add is that I think the notion of annotation groups will fit fairly easily into a hit table format. As I see it, the annotation group is just another scoring field (like e-value, p, etc). Eg:
id segment call p group rank
read1 V IGHV4-3*01 0.8 A 1
read1 V IGHV4-3*02 0.7 B 2
read2 V IGHV4-3*01 0.8 A 1
read2 V IGHV4-3*02 0.6 C 2
In the above, the group
adds information, but removing it doesn't break the format at all.
Seeing as it's been quite a few posts since we've actually disagreed on anything, I'll just say that I, too, like jason's multiple-file idea. Although maybe a yaml for the inherently structured/nested info like alternative annotations and multi-sequence alignments, and a tsv for a single-sequence only-the-best-annotation summary of that?
Hit table example: http://scikit-bio.org/docs/latest/generated/skbio.io.format.blast7.html
Not sure if this is exactly what folks were looking for, but what follows is roughly a yaml translation (by hand! I'm sure I screwed up/made stuff up) of a partition output file and then annotation output file. It's important to be able to represent many alternative partitions (for instance, in the most recent project with a wet lab they wanted us to keep clustering as far as possible to get anything remotely related -- i.e. just scroll down the partition file to a less-likely but more-clustered partition).
Logically speaking, an annotation file could correspond to each partition in the partition file, but it's much easier to just keep the concepts separate so you can write/not write whatever you need to.
Lots of this is just shorthand/ideas, e.g. the comments:
meta_info: {[...]}
partitions:
- !!<first partition>
log_prob: -100039391.39841
clusters
-!!<first cluster>
sequence_ids:
- 98913
- bacdd
- kjalsjdf;als
-!!<second cluster>
sequence_ids:
- AAA
- BBB
-!!<third cluster>
sequence_ids:
- zzz
- !!<second partition>
log_prob: -107388149391.39841
clusters
-!!<first cluster>
sequence_ids:
- 98913
- bacdd
- kjalsjdf;als
-!!<second cluster>
sequence_ids:
- AAA
- BBB
- zzz
[...]
meta_info: {germline_set_pointer : [...]}
annotations:
- !!<first annotation>
sequence_ids:
- 0001
- 0037
- 0073
input_seqs:
- NNNGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGGGGGGTATAGCAGTTGCTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGN
- NNNGAGCTGCAGTTGTTGCAGTCCGGAGCAGTGGTGAAAAAGTCCAGGGAGTCTCTGAAGATCTCTTGTAAGGGTTCTGGATCCAGCTTTACCACCTCCTGGACCGGCTGGGTGAGCCAGATGCCCGGGAAAGGCTTTGAGTTGGTGAGGATCATCTATCCTGGTGACGCTGATACCAGACACAGCCCGTCCTTCCATTGTTGGGTCACCTTCTCAGCCGACGCGTCCATTACCGCCGCCTATCTGCAGTGGAGTAGCCTGAAGGCCTCGGACACCGCCATGTATTGCTGTACGAGACACCACGGATCTTGTAGTAGTAGCAGCTGCTATGGCTTCCGCTTTGGTTCTTGGGGCCGGGGTACCTTGGTCACCGTCTCATCAG
- NNNGAGGTGCAGCTGATGCACTCCGCCCCAGAGGTGAAAAAACCCGGGCAGTCTCTCAAGATCTCATGTAAGGGTTCTGGATTTAGTTTTACCAGCTACTGGATCGGCTGGGTAAGCCAGATGCCTGGGAAAGGCTTCGTGTGGATGGGGATCATCTATCCTGGTCACTCTCATACCAGATACAGCCCGTCCTTCTGTGGACAAATCACGATTTCAGCTGACAAGTCCCTTAAGACCGCCTACCTGCAGTGGAGTAGCCTGAAGGCCTCGGACACTGCCATATATTACTGTGCGAGACACCGCGGATCTTGAGGTAGTAACATCTGCTATCTCTTCTTCCTTGGCTTATGGGGCCAGGGAACCCTGGTCACCGCCTCCTCAG
v_gene: IGHV5-51*02
d_gene: IGHD1-26*01
j_gene: IGHJ6*02
v_5p_del: 0
v_3p_del: 3
j_5p_del: 1
j_3p_del: 4
j_5p_del: 0
vd_insertion: 'ACGGTC'
dj_insertion: ''
cdr3_length : 54
naive_seq : NNNCAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGCGGTGGGAGCTACTACGCATTACTACTACTACTACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCANN
mut_freqs:
- 0.08
- 0.02
- 0.15
mutated_invariants:
- False
- True
- True
stop_codons:
- False
- False
- True
in_frames:
- True
- False
- False
v_per_gene_support:
- IGHV5-51*02 : 0.9
- IGHV5-51*07 : 0.05
- IGHV1-18*01 : 0.02
- IGHV3-3*08 : 0.03
d_per_gene_support:
- IGHD1-26*01 : 0.973204843503
- IGHD2-2*01 : 0.0181903564411
- IGHD2-2*02 : 0.00782888099897
- IGHD2-2*03 : 0.000414480059791
- IGHD2-21*02 : 0.000312917590508
- IGHD6-19*01 : 1.6672436832e-05
j_per_gene_support:
- IGHJ6*02 : 1.0
- IGHJ6*01 : 5.05824238798e-33
- !!<second annotation>
sequence_ids:
- 099099
[...]
and perhaps it's easier to visualize if I include screenshots of the output of the ascii-art printer reading this kind of files (view-partitions and view-annotations partis args)
So if I understand @psathyrella, these are the actual contents (partitions and annotations) that are produced and used by partis. VDJML can support your annotation file (and partition file) with minor changes, and they could be combined or kept in separate files. Don't like XML, well I think all the VDJML stuff can be specified with YAML, just like you showed.
Now this brings us back full circle. In the beginning of the WG, @laserson and others argued that format like JSON, YAML, XML, etc. provide greater flexibility over TSV. Some members strongly felt that these formats placed too high a usage barrier while TSV has essentially no barrier.
I would argue that we now have two cases, partis and IgBlast, that indicate a structured format is required to sufficiently and efficiently represent the output of the "VDJ alignment/assignment" step in an analysis. I'm not sure about IMGT VQUEST, maybe @javh knows?
My thought is that we actually have two primary uses cases: 1) the full "VDJ alignment/assignment" information is required, and 2) only the summary best annotation is needed. We can support both of these uses cases, with YAML for 1) and a TSV for 2). The TSV can be generated from the YAML by picking the "best annotation".
nah, I'm using csvs internally -- this is a rough/by-hand translation to yaml of the info in my csvs. Although, after writing it down and seeing how much cleaner it is, I will likely switch.
Also, while I was arguing before that we should have a tsv summary of the yaml information, after seeing the yamls, I don't think there's really any reason for this. While many people will be unfamiliar with the .yaml filename extension, I think we can make the format so blindingly obvious when you open the file that it's not worth introducing a whole other file to "simplify" things.
YAML is nice in certain ways, but if we could achieve what we want with 2 CSV files I think it will still be beneficial. Using CSV means that the data can be loaded into a dataframe trivially for analysis. And it will have broader programming language/environment support.
The other thing to note is that there is a large advantage to having one record per line. It ensures that large data files are trivially "splittable", i.e., if you start reading at some random byte offset into the file, it is unambiguous and easy to move the start of the next record. This would also be true for single-line JSON data (which technically would be valid YAML), but it certainly complicates matters somewhat.
Also, YAML/JSON is far more verbose than CSV, since you have to keep replicating the field names into each record.
I don't think language support is really an issue -- as far as I know every major language reads yaml trivially.
I totally agree that there is a huge benefit to having one record per line. But I think what we've spent the last few months establishing is that all the tsv formats we can think of have so much inter-line dependency that, while perhaps technically we could read one line at a time, in order to extract useful annotation information in practice we have to read many surrounding lines.
Yes, yaml is verbose, but that verbosity provides the human readability that makes it faster for debugging, and faster for people not familiar with the standard to start using it (no one reads manuals). Large files are anyway typically compressed so it wouldn't really cause problems.
But I think what we've spent the last few months establishing is that all the tsv formats we can think of have so much inter-line dependency that, while perhaps technically we could read one line at a time, in order to extract useful annotation information in practice we have to read many surrounding lines.
I don't think this is true in general. It seems to be an issue in particular for joint-annotation, which I think we should work to support. But mostly the inter-line dependencies are not very complex.
verbosity provides the human readability that makes it faster for debugging, and faster for people not familiar with the standard to start using it (no one reads manuals).
I don't think either of these points are more true for YAML over tab-delimited. Also, YAML's flexibility could potentially lead people that don't read the manual to do their own custom stuff in a way that's not compatible with the community.
It seems that this issue turns out to be moot, at least as practiced by partis. Partis, for example, reports multiple possible clusterings for a set of reads, which should induce a separate rearrangements file for each clustering. These sets of rearrangements can be represented with our spec.
Some example toy data: