EBISPOT / gwas-summary-statistics-standard

Documentation on new GWAS Summary Statistics Standard
18 stars 2 forks source link

Standard paper does not discuss reasons for selection of tsv/yaml #5

Open ianknowles opened 2 years ago

ianknowles commented 2 years ago

The underlying file types and formats are important for interoperability with data-sources beyond the GWAS community. The paper does not discuss this or any of the selection criteria for the file types beyond some stakeholders prefering a text file format.

I do not anticipate a big issue with parsers. Most delimited data parsers support tsv, csv or any possible choice of delimiter character. YAML is a superset of the JSON standard and most implementations supporting JSON will use a YAML parser.

However when it comes to interacting with REST APIs and other online services and applications YAML/TSV support is uncommon, while the underlying service parsers may support parsing of the files the endpoints may simply not have been programmed to accept them. Firstly because JSON is the typical data format used in Javascript applications. And secondly because the W3C Standards body has published a standard from the Open Data Institute in 2016 that describes a CSV/JSON standard for all web applications to follow when handling tabular data (CSV with JSON metadata file).

A detailed rationale for diverging from the primary web standard body's only international standard on tabular data files should be given if the GWAS community is fixed on YAML/TSV, but my recommendation would be to ammend the GWAS standard to use the W3 standard for tabular data as a foundation, and define the GWAS required fields as an extension. The tabular data standard provides an exhaustive description of the data file and metadata file.

The tabular data standard does allow for TSV files: https://www.w3.org/TR/tabular-data-primer/#dialects However the metadata file format is defined as JSON: https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#metadata-format YAML is not permitted under the standard and its use would likely introduce a conversion step from YAML to JSON when interacting with a standards compliant service.

The underlying issue with YAML/TSV is the use of whitespace as markup, while proponents often suggest this makes it easier for non-technical users to edit the files in practice non-technical and even expert technical users are confused by errors within the whitespace. The default behaviour of text editors to display all whitespace characters as a blank makes it difficult to locate the error without a syntax aware editor, and some editors will introduce additional whitespace automatically. This is why JSON/CSV are the preferred format as they are conceptually and syntactically identical with none of the drawbacks of using whitespace characters.

The overview of the standard can also be found on github https://github.com/w3c/csvw.

jdhayhurst commented 2 years ago

Hi @ianknowles,

Thanks so much for your feedback. This is an excellent point. However, I'm not 100% sure if the GWAS community would put much value on it if the standard diverged from the WC3 recommendation. Bear in mind that historically the vast majority of GWAS summary statistics files are TSVs and the nearest neighbouring format is probably VCF, which is also tab-delimited. This is largely why we're proposing it in the standard.

This is too significant of a change to implement right away without GWAS community input. However, we are in the process of taking this format to GA4GH, and once we've got into their work stream we will continue to iterate on the standard with the GWAS community and make new releases of the standard that incorporate requirements such as this one. I'd be very interested to hear thoughts of the community on this. If you have any other supporting information/arguments please drop them here, otherwise I'll leave this issue open so that we can review it in due course.

ianknowles commented 2 years ago

TSVs themselves are not my main point of concern, as I mentioned generally speaking a parser that supports CSV will support TSV and a web service can be reasonably expected to support any delimiter (though in some cases only csv); the W3C standard also permits TSV so it is expected that a standards compliant application will consider TSV support, and its a reasonable request if not. The issue with editing tab files remains but is not significant enough to need to reprocess old summary files. The standard could support tsv only or csv/tsv and still be compliant with the W3C standard and with minimal issues.

YAML is the biggest difficulty, its the most likely to introduce conversion steps and difficulties interoperating with other services, its not considered a standard format by many languages and web services, compared with JSON where support is almost ubiquitous. And YAML is the format not permitted under the W3C tabular data standard. Most data apis will only process XML or JSON (with XML being somewhat legacy at this point). eg. https://fred.stlouisfed.org/docs/api/fred/category.html

ianknowles commented 2 years ago

There is a summary of the issues with the YAML format posted here: https://www.arp242.net/yaml-config.html

The considerations are imo largely why the format is not widely supported as JSON became the preferred format (for data interchange, the post notes in its conclusion that JSON is also a poor choice for local config files).

There are some alternatives to JSON that allow comments and are considered more usable, JSON5 or HOCON for example, but these have the same issue for data interchange as YAML, they are not supported as standard.

Another option you could consider is describing in the standard how a YAML metadata file will be converted to JSON, but this would be a complicated addition to the standard, YAML being a superset of JSON supports additional syntax that would either need to be restricted or mapped to JSON syntax (as JSON is a subset of YAML, in the inverse any valid JSON is valid as YAML). This is extra work that could be avoided by standardising on JSON, but it needs to be considered by the standard. If the standard does not define how to perform this conversion you are leaving it to individual users and applications to define how the data should be transformed to JSON for interoperation, and this somewhat undermines the reasons for developing a standard (at least when considering interoperation beyond GWAS stats with the largest number of tools and services).

kmhernan commented 1 year ago

Related: @jdhayhurst TSV etc doesn't scale easily and other issues. For me the standardized components are the most important, but likely I'd be interested in figuring out more scalable and self-documenting formats in the future that still comply to the various components. There could be tooling to transform into different formats. TSV/CSV is obviously one that a lot of users would want, but should we be limited to it? A format that includes the metadata and the summary statistics, especially if compressed/efficient, would be wonderful for cross-source ETL and sharing. I'm not saying that format needs to be defined right now, just that it should be proposed in a way that separates format from standardized components if that makes sense.

ramiromagno commented 1 year ago

Indeed, formats such as feather, parquet or HDF5.

jdhayhurst commented 1 year ago

Hi all, thanks for your feedback these are great suggestions and are concerns that we share ourselves. The main goal of GWAS-SSF was to identify key mandatory and non-mandatory data and metadata fields for usability. Feedback from our working group was that the actual format was secondary to the data content at this point, as long as interoperability could be achieved. The precise format and requirements for the format (beyond being easy to share with minimal bioinformatics skill) is something we are be keen to explore in future. TSV and an optionally read metadata file (in a human and computer readable format) were deemed to be a universal option for being able to read, interpret and convert the data and that was fundamental to their choice here for version 1. I'll leave this issue open for any further discussions.