Input file row values not read correctly if missing values

juhaa commented 3 years ago

If a row in the input summary stats has missing values, wrong columns can be read in.

Example: Input summary stats is missing an rsid:

#chrom	pos	ref	alt	rsids	nearest_genes	pval	mlogp	beta	sebeta	af_alt	af_alt_cases	af_alt_controls	n_hom_cases	n_hom_ref_cases	n_het_cases	n_hom_controls	n_hom_ref_controls	n_het_controls
5	44887686	AAAC	AAACAACAAC		MRPS30	6.43e-12	11.1916	-0.10338	0.01505	0.558	0.534	0.56	3354.58	2556.47	5661.95	42627.59	26442.58	66417.83

Meta-analysis output (column `mlogp` was used as p-value because of the missing `rsids` field (and other wrong fields as well)):	#CHR	POS	REF	ALT	SNP	FINNGEN_beta	FINNGEN_sebeta	FINNGEN_pval	FINNGEN_af_alt	FINNGEN_af_alt_cases	FINNGEN_af_alt_controls	FINNGEN_rsids	UKBB_beta	UKBB_sebeta	UKBB_pval	UKBB_low_confidence_EUR	all_meta_N	all_inv_var_meta_beta	all_inv_var_meta_sebeta	all_inv_var_meta_p	all_inv_var_het_p
5	44887686	AAAC	AAACAACAAC	5:44887686:AAAC:AAACAACAAC	1.50e-02	5.58e-01	1.12e+01	0.534	0.56	3354.58	MRPS30	NA	NA	NA	NA	1	1.50e-02	5.58e-01	1.12e+01	NA

Solution: When parsing rows, handle missing values

juhaa commented 3 years ago

Problem seems to be this part: https://github.com/FINNGEN/META_ANALYSIS/blob/1e3f24ced531574b7f9814837dd3abc5486d7c16/scripts/meta_analysis.py#L396-L403 If running with param --dont_allow_space, the missing values are parsed correctly. Without it, the column is stripped out as whitespace and indices read from header point to wrong columns as the row has fewer fields:

>>> l
'1\t123\tAAA\tC\tABBA\t\t0.01\n'
>>> l.rstrip()
'1\t123\tAAA\tC\tABBA\t\t0.01'
>>> l.rstrip().split()
['1', '123', 'AAA', 'C', 'ABBA', '0.01']
>>> l.rstrip().split('\t')
['1', '123', 'AAA', 'C', 'ABBA', '', '0.01']

Fedja commented 3 years ago

yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?

Fedja commented 3 years ago

all our data is tab separated right?

juhaa commented 3 years ago

yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?

all our data is tab separated right?

I guess would be good to set as default to not allow spaces since that's probably a very rare occasion there is space-separated data, at least I haven't seen any other other than tab-separated data? But definitely let's add the check for the columns. Quick add and helps avoid these little mishaps.

FINNGEN / META_ANALYSIS

Input file row values not read correctly if missing values #32