cozygene / glint

22 stars 8 forks source link

"replace_missing_values.py" excludes samples from the output data when 'maxi'==1 #7

Closed hdraisma closed 3 years ago

hdraisma commented 3 years ago

Hi there

I’m wondering whether I encountered a bug in version 1.0.4's "replace_missing_values.py" routine: when setting the value of the "--maxi" parameter to 1 – corresponding with any "fraction of missing values allowed per sample", as per https://glint-epigenetics.readthedocs.io/en/latest/input.html – the routine still excludes samples from the output data ('xx samples were not replaced because they have more than 1.0 missing values'). Example:

$ python "<path_to_GLINT1.0.4>/replace_missing_values.py" --datafile <datafile.txt> --chr NA --maxs 0 --maxi 1 --sep " "
<path_to_GLINT1.0.4>/utils/common.py:121: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  first_col = DataFrame.as_matrix(read_csv(filepath, dtype=str, delimiter=delimiter, usecols=[0], header=None))
<path_to_GLINT1.0.4>/utils/common.py:177: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  data = DataFrame.as_matrix(data)
Replacing missing values by mean...
54 samples were not replaced because they have more than 1.0 missing values
57044 sites were not replaced because they have more than 0.0 missing values
replacing each missing value by it's site mean...
Output is saved to <datafile.txt>.no_missing_values
$

Thanks in advance for your consideration and help,

B.w.

E-R commented 3 years ago

You are right. I fixed this issue in replace_missing_values.py on master; this will appear in GLINT's next release. Thanks!