Open delocalizer opened 2 years ago
Thanks for the detailed and clear comments.
The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here, or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer.
It’s clear that including metadata in the header is an optimal choice for data integrity. With respect to the GWAS Catalog, we believe that the risks in separating the data and metadata are already limited by sharing data via a FAIR resource (i.e. fully accessioned and controlled with respect to update). We heard in our working groups that it could be a big stretch for some users to use a format with metadata in the header, presenting an additional overhead and barrier to sharing and/or use of the data, which would be counterproductive. Therefore we don’t feel it’s appropriate to commit resources to change our ingest pipelines to adopt such a file format at the current time. However we will continue to monitor the situation as the field evolves and more tooling becomes available.
Fair enough; you know your userbase.
It still worries me that there is no explicit reference genome info in the main data file. Perhaps as a compromise, allow and encourage the "chromosome" column to be a RefSeq accession instead of just a number. That way you implicitly but unambiguously specify the species and reference build, e.g. NC_000001.11
for human chromosome 1, GRCh38 patch 14.
Why not have something like GWAS-VCF for submission and storage and have Excel (with one tab for associations and one tab for metadata) as an additional download option from the website? I assume that users who have difficulty using a format with metadata in the header are those who use Excel as their main "bioinformatics" tool and won't be happy with YAML either.
Whatever GWAS Catalog picks as a format is going to become the de-facto standard for bioinformaticians and tool developers and, in my humble opinion, picking it based on the needs/skills of "Excel bioinformaticians" seems to me not to be the best approach or the solution to the problems that have afflicted the field so far.
Having two separate files seems like a recipe for confusion and error.
pvalueIsNegLog10
,sortedByGenomicLocation
, and vitally,genomeAssembly
. You're not actually allowed to update them independently.dataFileMd5sum
guarantees integrity in one direction only; if I update the metadata file to refer to a new reference but forget to update the summary file to actually do the coordinate liftover then no error will be raised.dataFileName
in the metadata is redundant, fragile, and cumbersome. People rename files all the time. It would be unexpected behaviour to most users that if they rename the summary file on the filesystem they need to update an internal field value in the metadata file.I'm reluctant to argue by anecdote but "everyone I know" hates two-part file formats, and from my experience, for good reason.