EBISPOT / gwas-sumstats-validator

GWAS Summary Statistics File Validator
Apache License 2.0
2 stars 2 forks source link

NOTE: This reposisitory has been deprecated as of 12th Apr 2023. The GWAS Catalog summary statistics validator can be found here

Summary Statistics TSV file Validator

A file validator for validating GWAS summary statistics TSV files prior to and post harmonisation using pandas_schema. The purpose is to validate files before their conversion to HDF5.

Installation

Python package:

Alternatively, use the docker image:

Running the validator

To run the validator on a file:

Information and errors are logged to the console and errors logged to the file specified. A console output might look like:

(INFO): Filename is good!
(INFO): Validating file...
(ERROR): Length of row 7 is: 16 instead of 15
(ERROR): Please fix the table. Some rows have different numbers of columns to the header
(INFO): Rows with different numbers of columns to the header are not validated
(ERROR): {row: 1, column: "p_value"}: "-99" was not in the range [0, 1)

The errors from the output tell us that row seven has too many columns and row one does not have a valid pvalue.

Addional options

Import ss-validate to another python script

initialise a validator object for your summary statistics and settings

validator = ssv.Validator(file='sumstats.tsv.gz', filetype='gwas-upload', error_limit=1, logfile='logfile.log')

validate the headers

validator.validate_headers()

validate the squareness

validator.validate_file_squareness()

validate the data

validator.validate_data()