Closed juhaa closed 2 years ago
Problem seems to be this part:
https://github.com/FINNGEN/META_ANALYSIS/blob/1e3f24ced531574b7f9814837dd3abc5486d7c16/scripts/meta_analysis.py#L396-L403
If running with param --dont_allow_space
, the missing values are parsed correctly. Without it, the column is stripped out as whitespace and indices read from header point to wrong columns as the row has fewer fields:
>>> l
'1\t123\tAAA\tC\tABBA\t\t0.01\n'
>>> l.rstrip()
'1\t123\tAAA\tC\tABBA\t\t0.01'
>>> l.rstrip().split()
['1', '123', 'AAA', 'C', 'ABBA', '0.01']
>>> l.rstrip().split('\t')
['1', '123', 'AAA', 'C', 'ABBA', '', '0.01']
yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?
all our data is tab separated right?
yea by default don't allow space exactly for this reason! If space separated then missing value needs to be coded differently. Actually can you check that the number of columns equals the number of header items and error out if not?
all our data is tab separated right?
I guess would be good to set as default to not allow spaces since that's probably a very rare occasion there is space-separated data, at least I haven't seen any other other than tab-separated data? But definitely let's add the check for the columns. Quick add and helps avoid these little mishaps.
If a row in the input summary stats has missing values, wrong columns can be read in.
Example: Input summary stats is missing an rsid:
mlogp
was used as p-value because of the missingrsids
field (and other wrong fields as well)):Solution: When parsing rows, handle missing values