broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 589 forks source link

Standardize formats for tabular data. #4717

Open sooheelee opened 6 years ago

sooheelee commented 6 years ago

This ticket opens a discussion between @samuelklee, @davidbenjamin and anyone else who is interested towards deciding if this is useful.

I think it would be useful for (i) the outputs of CollectAllelicCounts and GetPileupSummaries and other similar tools to be in VCF format and for (ii) downstream tools that take these results to then also take in VCF format. I think it's good to adhere to standard formats, for results that already nearly resemble variant calls, for versatility.

At the least, it would be great if the format that is output by the two tools and accepted by downstream tools is unified.

There is also discussion on whether these two particular tools should be merged. There may be other tickets on this particular discussion.

samuelklee commented 6 years ago

As I mentioned in our Slack chat, there is increasingly a need to standardize annotated-interval data (e.g., GC content, mappability) in the GATK. If we decide to co-opt the VCF format to do so for these tools, we might as well do it GATK-wide. The CNV data-collection/TSV classes are already designed in such a way that it should be easy to add VCF output.

However, I don't know if this is the right answer, since there are many required but generally non-applicable VCF fields (ID, REF, ALT, QUAL, FILTER) that will have to be filled in with arbitrary conventions. File sizes will probably also balloon.

@jonn-smith @TedBrookings might have thoughts.

heuermh commented 6 years ago

Please don't.

samuelklee commented 6 years ago

I tend to agree. VCF is perhaps a special case of the format we actually need, so no point in shoehorning output that doesn’t fit just for the sake of standardization. @sooheelee may feel more strongly otherwise.

TedBrookings commented 6 years ago

Standard disclaimer up front that I'm still a bit new and can't speak to the needs of the GATK outside the small bits I've looked at.

The data I've wanted so far had all come from the UCSC genome browser, which has a reasonable format and good documentation. My personal feeling is there's little reason to reformat the data or choose a different format for data we may create. For me the real problem is establishing fast and reliable look up of that data. I've so far just thrown a couple light weight tracts into the resources on my git branch and loaded them into an interval tree. That storage strategy will not scale for all of the UCSC data, and in any case might antagonize them.

On Mon, Apr 30, 2018, 11:59 PM samuelklee notifications@github.com wrote:

I tend to agree. VCF is perhaps a special case of the format we actually need, so no point in shoehorning output that doesn’t fit just for the sake of standardization. @sooheelee https://github.com/sooheelee may feel more strongly otherwise.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4717#issuecomment-385593864, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKhCLeOf8aOnroZA0wirz8zeBoeodNsks5tt92xgaJpZM4TtIPq .

samuelklee commented 6 years ago

@TedBrookings which formats are you using, in particular? If there is a format out there that nicely fits our needs, we should adopt it.

In the CNV package, I've taken pains to unify how tabular data are represented in Java, depending on whether each record is Locatable or whether the collection of records can be associated with a sample name or sequence dictionary. This allows us to represent records that extend Locatable with multidimensional numerical or non-numerical annotations along with some metadata (sample name and sequence dictionary) with a minimum of boilerplate. There are also base methods for producing interval trees, etc. This pretty much satisfies all of the CNV team's needs (and is, in my opinion, a necessary improvement over the horrowshow of read/write utility methods for each file format that we had previously...)

However, this unification effort was a quick push I made before release, so some polishing or redesigning may be warranted. We may also want to add more forms of metadata, etc. if other teams would require more features. Another downside is that this code lacks the indexing, NIO support, etc. that some of the other standardized/Tribble formats enjoy. For CNV data, this isn't a huge issue, but I think it would be nice to unify how we represent such data GATK-wide. As I said above, I don't think VCF is the correct answer, but certainly it could fit into whatever framework we adopt or come up with.

TedBrookings commented 6 years ago

I'm still not very familiar with the way people have used extensions of Locatable, or how indexing can be used to boost performance. The stuff I've worked on is a bit of a corner case, and I didn't write much of the infrastructure, I've been tacking on features.

For now I've just been keeping the gzipped text files from the UCSC browser. They're tab delimited with two header lines, the first basically giving info about context of the data (it's genome data for the , hg38) and the second being a description of the columns (each being of form tract_name.column_name). There's nothing at all sophisticated about this format, but it's pretty generalizable and easy to parse (and create). An example

hgIntegrator: database=hg38 region=genome Wed Apr 18 11:15:34 2018

gap.chrom gap.chromStart gap.chromEnd gap.type

chr1 0 10000 telomere

chr1 207666 257666 contig

chr1 297968 347968 contig

chr1 535988 585988 contig

chr1 2702781 2746290 scaffold

For what it's worth, your description of your approach sounds like a sensible one to me. I am concerned about the size of the data and how we'd access it. I've chosen the tracts I have because they are small enough to jam into resources.

On Tue, May 1, 2018 at 8:06 AM samuelklee notifications@github.com wrote:

@TedBrookings https://github.com/TedBrookings which formats are you using, in particular?

In the CNV package, I've taken pains to unify how tabular data are represented in Java, depending on whether each record is Locatable or whether the collection of records can be associated with a sample name or sequence dictionary. This allows us to represent records that extend Locatable with multidimensional numerical or non-numerical annotations along with some metadata (sample name and sequence dictionary) with a minimum of boilerplate. There are also base methods for producing interval trees, etc.

However, this unification effort was a quick push I made before release, so some polishing or redesigning may be warranted. We may also want to add more forms of metadata, etc. if other teams would require more features. Another downside is that this code lacks the indexing, NIO support, etc. that some of the other standardized/Tribble formats enjoy. For CNV data, this isn't a huge issue, but I think it would be nice to unify how we represent such data GATK-wide. As I said above, I don't think VCF is the correct answer, but certainly it could fit into whatever framework we come up with.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/4717#issuecomment-385656468, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKhCDUca-ZFaXWUoM6bn-LrGlgzx8jDks5tuE_QgaJpZM4TtIPq .

sooheelee commented 6 years ago

Again, I feel strongly that the outputs of CollectAllelicCounts and GetPileupSummaries, being of similar data-type, be unified.

Also, it would be great if their formats adhered to one of those accepted by IGV listed at http://software.broadinstitute.org/software/igv/FileFormats, of which VCF is one.

If another format is preferred (it seems @samuelklee is opposed to VCF), then having one unified format allows us to make a stronger case for getting visualization support.

mwalker174 commented 6 years ago

I'm more familiar with working with interval data files so I can't speak much to variant call formats except that VCF tends to be a bit wonky with interval data (such as SV calls) because, as @samuelklee mentioned, most of the fields are likely to be irrelevant and thus waste space.

I feel strongly that BED-like formats are the way to go for interval data, i.e.

contig start end x1 x2 x3 x4 ...

chr1 2938 3949 3.9 0 + cat ... ...

where x1, x2, x3, x4 ... are columns for variables of different types (which could be specified by additional header lines like in VCF).

jonn-smith commented 6 years ago

Annotations in VCF are a nightmare due to format requirements.

I'd recommend against using VCF to store annotations unless it's absolutely necessary.

The mechanism to do so is too unwieldy - either you add annotations by name per allele as real annotations (i.e. accounted for in the header and then added to the INFO field as applicable with per-allele annotations separated by commas), or you add them to the INFO field as a pipe-delimited single annotation field, with commas separating this long annotation for each allele (this is currently what Funcotator does).

Both are kind of gross, with the former taking a LOT of extra space and the latter being basically unreadable by eye.

You also need to make sure annotations are sanitized for illegal characters (such as commas). Funcotator has an open issue for this.

A tabular format for annotations makes more sense, and, as much as it pains me to suggest it, MAF may be a quick answer here.

sooheelee commented 6 years ago

First to clarify, GetPileupSummaries and CollectAllelicCounts tabulate ALT vs. REF counts at SNP sites. These tools do not consider anything larger than 1bp at a time and are basically akin to pileup callers. File size is a sacrifice that I would trade for easy parsing of the data, which the plethora of VCF tools available enable.

@mwalker174's comment gives me an idea. If we need a format that stores intervals, or blocks, then we already have one--the GVCF format. This format is already compatible with IGV. Here's an example record:

chr1    58999897    .   T   <NON_REF>   .   .   END=58999902    GT:DP:GQ:MIN_DP:PL0/0:31:90:31:0,90,1350

The END value is interpreted by IGV as the end of the block and the GVCF format is recognized as a valid VCF format, one that uses a symbolic allele demarcated with <...>. Symbolic allele representations are described as early as VCF v4.1 specs and expanded upon in v4.3 specs.

I think @jonn-smith you refer to MAF-format functional annotations. Yes? Here we are discussing much less complex variant call level annotations like genotype that would have a few associated numerical values, e.g. 10th, 50th, and 90th confidence interval allele fractions AND copy ratio value for each CN segment.

One last thing of interest towards visualization. I've been reminded recently of some open-source software that is a standard in visualizing somatic data for the biologist in the Cancer Genetics field--CBioPortal (I plugged in some data--try clicking on the different tabs). They are Dockerized and their repo is at https://github.com/cBioPortal/cbioportal. They do a good job bridging the gap between genomics data and what the biologist is interested in. Their CNA visualization currently takes GISTIC-type data. For segmented CNs, they use a JavaScript version of IGV. It's worth checking out and perhaps enabling GATK users to visualize their somatic data with CBioPortal.

I'd like for us to save all our time for the interesting questions and leverage the hard work of others towards questions such as standarization and visualization.

sooheelee commented 6 years ago

@heuermh, please give your reasons why not VCF. We'd like to hear.

samuelklee commented 6 years ago

I think it's best not to co-opt existing formats for storing variant calls and mutations if we want to store generic annotations. Furthermore, many of the drawbacks of VCF (e.g, wasted space from repeated tags/unused fields) are really not worth dealing with if our data is strictly tabular and well structured.

I think if we can settle on a format internally that satisfies all of our needs, then it'd probably a very small amount of effort on the part of external developers to write adapters to consume it. After all, we are only talking about metadata (hopefully in a standardized but suitably flexible format, e.g. SAM/VCF-style header) + tabular data. It may also be that there is a format out there that already fits the bill, in which case we just need to do some more research and discussion.

I think this would be better than causing confusion and setting a bad example by co-opting unsuitable formats, even if this would require no additional effort for external developers.

sooheelee commented 6 years ago

What I'd like to be able to do ultimately is for the CNV workflow to accept a Mutect2 callset for the allelic counts, which is in VCF format. This sort of thing would be more feasible if ModelSegments could take in the VCF format and not just the tsv output of CollectAllelicCounts.

samuelklee commented 6 years ago

That is a separate matter altogether from both 1) unifying the allele-count collection tools, and 2) standardizing the format of tabular data. The most appropriate place for integration of Mutect2 SNV calls would be as input to the tumor-heterogeneity tool (along with the ModelSegments output) further downstream. This is because it is unlikely that including the SNVs as input to ModelSegments would significantly improve either segmentation or modeling there.

If the allele-count collection tools are unified, I think that the only redundant work done across both pipelines would be the calling of hets from the pileups, which is extremely cheap. However, we should certainly also unify the code to do this (which I've spoken to @davidbenjamin about as well).

droazen commented 6 years ago

Agree with those above arguing that VCF isn't appropriate for this purpose, and would be a very bad fit. I certainly support the goal of adopting a single unified, standard format for tabular data throughout the GATK, however, and would ideally favor a solution at the HTSJDK/tribble level to get all the benefits that that provides, such as support for NIO and indexing (even if it means that engine team has to extend tribble to support records that are not Locatable).

We're happy to help with any efforts in the direction of unification, and would be eager to participate in any methods-wide discussions on this topic.

davidbenjamin commented 5 years ago

Here's an M2 wishlist:

tedsharpe commented 5 years ago

It seems to me that bedgraph is the best fit for this data type. Whatever we decide, it's critically important to be able to distinguish data from metadata. No unmarked column header, please.

SHuang-Broad commented 5 years ago

Let me play devil's advocate here (not saying VCF or anyone is evil):

if the final standard doesn't restrict what type of annotations are allowed/intended to be supported, I strongly feel the direction of this standard will go into another VCF-like monster.

BED? Look at how many people are calling their files BED while the 4-beyond columns are free-style hip-hops (sure there's no international standard). VCF? Try getting people into agreeing on how to unambiguously specify SV's in VCF....

Both are TSV files technically, offering a heck of flexibility.

samuelklee commented 5 years ago

Thanks for chiming in!

@davidbenjamin can you give a little more detail on the kind of merging operations you'd need?

@tedsharpe I believe bedGraph only allows a single annotation/track. I'm not sure if the track definition line is intended to hold any metadata other than display parameters, either? https://genome.ucsc.edu/goldenPath/help/bedgraph.html

As for the unmarked column header line, the reason I decided this would be useful in the CNV TSV formats is that it's very easy to throw the table into a pandas or R dataframe for quick analysis, where you can then use the column names to manipulate the table. Typically, pandas/R TSV loading methods let you specify the @ comment character to strip the SAM header (although we recently ran into some trouble with this in https://github.com/broadinstitute/gatk/pull/581). Note that we require a single unmarked column header, which is easy enough to skip (in the case you don't want to use it) if you know it's there.

On the other hand, one could argue that if we store the type of each column in the metadata, then any analysis code should technically use that to parse the table (rather than letting pandas/R automatically infer the type of each column). So a marked column header line would make quick analyses a bit more difficult (as users would need to write parsing code), but could encourage more careful downstream code practices.

@SHuang-Broad Just to be clear, the way I originally used "annotation" refers to any quantity that could be represented by a single type in a column (not in the sense of variant annotation). If string types are allowed, this is indeed pretty flexible! All I care about extracting is the common functionality related to the fact that we have locatable columns.

I think the concerns you raise about e.g. SV representation in VCF are a separate matter, but happy to discuss further.

I think once we decide what the header needs to be able to represent and what it should look like, this problem is mostly solved. There may be some things to decide about e.g. representation of doubles, NaNs, etc. but I don't think we need to be too rigid here.

tedsharpe commented 5 years ago

Thanks for helping me to understand why you didn't mark the metadata.

This may seem like quibbling, but I'd suggest that we mark the metadata with a comment character, and let the pandas/R users remove it. They'll notice if they forget to do that, because the columns won't be named as expected, and they'll have to fix it up.

Whereas the risk for automated programs is that they'll simply delete the first row, which might be real data if the file has been reordered for some reason, or if the tool implementing the standard is non-compliant. The resulting bugs will be subtle, and might easily go undetected. Building in behavior to delete lines starting with "CHROM\t" seems odd and fraught with peril in a way that building in behavior to strip comments, doesn't.

That's my take, anyway.

davidbenjamin commented 5 years ago

@davidbenjamin can you give a little more detail on the kind of merging operations you'd need?

Just trivial stuff that amounts to concatenation eg scattering GetPileupSummaries over chromosomes and then merging them into a single table. Inputs would be internally sorted and non-overlapping.

samuelklee commented 5 years ago

@davidbenjamin OK, we have a utility method in AbstractLocatableCollection that essentially enables this for our particular use case (sorting and concatenating sharded tables, which are themselves non-overlapping and sorted, and returning the sort order; no reason why we couldn't just return the resulting concatenated table, either). These sorts of methods can be easily made generic if we have good base classes for individual locatable records that implement the appropriate interfaces.

heuermh commented 5 years ago

Including the schema in a header in text format is broken, and makes sharding error-prone and inefficient. I'd suggest defining schema using Apache Avro, and storing the data in Apache Parquet format on disk.

SHuang-Broad commented 5 years ago

@samuelklee I agree with your comments, but if arbitrary type of annotations are allowed, especially when multiple values (lists) are allowed, then it is going to look very like VCF. I was simply pre-warning.

samuelklee commented 5 years ago

@heuermh Thanks for your input, we'll look into those options. I think the use case most of us are imagining is for narrow tables with ~1M-1B records or fewer (sometimes far fewer). For the CNV pipeline, Tribble support is a much higher priority than Hadoop integration, efficient compression, etc., and it's very unlikely we'll need to do out-of-core computations that won't be enabled by Tribble. However, that might not be true of more general use cases, so it's certainly worth investigating more robust formats.

@SHuang-Broad For all current CNV use cases, we use separate columns for each annotation. Not sure how much VCF code actually needs to be shared if all we care about is extracting parsing and Tribble indexing.

SHuang-Broad commented 5 years ago

@samuelklee My understanding: the code that can (and I think should) be borrowed from VCF is CHROM, POS, ID, INFO, with END from INFO extracted to be its own column. Then

Recap: only REF, ALT are missing, which is not much code I believe.

I think VCF just happens to have a name that starts with V. Stripping out the REF, ALT, it is quite flexible for describing any annotated interval (OK, 0-length is up for debate) on a piecewise linear coordinate.

And I just made myself sound like a VCF-lover. I simply think much of it can be reused.