EBISPOT / gwas-summary-statistics-standard

Documentation on new GWAS Summary Statistics Standard
17 stars 2 forks source link

Ease-of-use/readibility vs data integrity #6

Open marcora opened 2 years ago

marcora commented 2 years ago

I fully understand the need for a format with low bioinformatics requirement for data consumers, but sacrificing data integrity (e.g., by separating sumstats and metadata) to achieve that goal seems dangerous to me.

Providing a simple tool (or a "download format" option on the GWAS catalog website) that can convert a format with superior bioinformatics qualities (e.g., GWAS-VCF) but inferior readibility and ease-of-use to a format for the general population (e.g., MS Excel-compatible CSV or XLSX) would be a better solution in my opinion.

seandavi commented 1 year ago

I have to agree with this comment and, in particular, with the idea of repurposing VCF. Excellent, performant tooling exists for VCF formats. Integration with existing annotation sources, including other VCFs seems a common use case that is quickly and easily doable using VCF tooling. Conversion from VCF to TSV is straightforward as needed. Creating a tab-delimited format as a "standard" seems like a step backward, though the information content described in the spec document is clearly very well-thought-out.

ljwh2 commented 1 year ago

Thanks for the comments.

The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here, or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer.

It’s clear that including metadata in the header is an optimal choice for data integrity. With respect to the GWAS Catalog, we heard in our working groups that it could be a big stretch for some users to use this format, presenting an additional overhead and barrier to sharing and/or use of the data, which would be counterproductive. We believe that the risks in separating the data and metadata are already limited by sharing data via a FAIR resource. Therefore we don’t feel it’s appropriate to commit resources to change our ingest pipelines to adopt a file format with metadata in the header at the current time. However we will continue to monitor the situation as the field evolves and more tooling becomes available.

marcora commented 1 year ago

It takes one command to convert GWAS-VCF to a more "non-bioinformatician user"-friendly format. In my opinion, GWAS Catalog should offer summary stats in various formats for various users (since it seems you are aiming to satisfy non-expert users, I would recommend Excel with additional tab for metadata and GWAS-VCF for bioinformaticians). But whatever you propose as "standard" is going to become the de-facto standard in the community of bioinformaticians and tool developers, and in my opinion that should be the format with data integrity (and therefore reproducibility) as the foremost priority.

ljwh2 commented 1 year ago

Yes, we would love to provide different formats for different users and this could be a future goal. For now, OpenGWAS (as I’m sure you know) are providing GWAS Catalog summary statistics in GWAS-VCF format, and the new mandatory fields should increase the number of data files that are suitable for MR and hence for them to ingest.