FINNGEN / META_ANALYSIS

Tools for doing x way meta-analysis
MIT License
6 stars 14 forks source link

Input file row values not read correctly if missing values #32

Closed juhaa closed 2 years ago

juhaa commented 3 years ago

If a row in the input summary stats has missing values, wrong columns can be read in.

Example: Input summary stats is missing an rsid:

#chrom pos ref alt rsids nearest_genes pval mlogp beta sebeta af_alt af_alt_cases af_alt_controls n_hom_cases n_hom_ref_cases n_het_cases n_hom_controls n_hom_ref_controls n_het_controls
5 44887686 AAAC AAACAACAAC MRPS30 6.43e-12 11.1916 -0.10338 0.01505 0.558 0.534 0.56 3354.58 2556.47 5661.95 42627.59 26442.58 66417.83
Meta-analysis output (column mlogp was used as p-value because of the missing rsids field (and other wrong fields as well)): #CHR POS REF ALT SNP FINNGEN_beta FINNGEN_sebeta FINNGEN_pval FINNGEN_af_alt FINNGEN_af_alt_cases FINNGEN_af_alt_controls FINNGEN_rsids UKBB_beta UKBB_sebeta UKBB_pval UKBB_low_confidence_EUR all_meta_N all_inv_var_meta_beta all_inv_var_meta_sebeta all_inv_var_meta_p all_inv_var_het_p
5 44887686 AAAC AAACAACAAC 5:44887686:AAAC:AAACAACAAC 1.50e-02 5.58e-01 1.12e+01 0.534 0.56 3354.58 MRPS30 NA NA NA NA 1 1.50e-02 5.58e-01 1.12e+01 NA

Solution: When parsing rows, handle missing values

juhaa commented 3 years ago

Problem seems to be this part: https://github.com/FINNGEN/META_ANALYSIS/blob/1e3f24ced531574b7f9814837dd3abc5486d7c16/scripts/meta_analysis.py#L396-L403 If running with param --dont_allow_space, the missing values are parsed correctly. Without it, the column is stripped out as whitespace and indices read from header point to wrong columns as the row has fewer fields:

>>> l
'1\t123\tAAA\tC\tABBA\t\t0.01\n'
>>> l.rstrip()
'1\t123\tAAA\tC\tABBA\t\t0.01'
>>> l.rstrip().split()
['1', '123', 'AAA', 'C', 'ABBA', '0.01']
>>> l.rstrip().split('\t')
['1', '123', 'AAA', 'C', 'ABBA', '', '0.01']
Fedja commented 3 years ago

yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?

Fedja commented 3 years ago

all our data is tab separated right?

juhaa commented 3 years ago

yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?

all our data is tab separated right?

I guess would be good to set as default to not allow spaces since that's probably a very rare occasion there is space-separated data, at least I haven't seen any other other than tab-separated data? But definitely let's add the check for the columns. Quick add and helps avoid these little mishaps.