HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

least_nan_cols method of DataFrameWithInfo accepts an external threshold parameter #27

Closed alessiamarcolini closed 4 years ago

alessiamarcolini commented 4 years ago

least_nan_cols method of DataFrameWithInfo accepts an external threshold parameter to get the features with a count of NaN values lower than the threshold argument.

DataFrameWithInfo has a nan_percentage_threshold (default 0.999) attribute used only in many_nan_columns method, which returns name of the columns containing many NaN (over the threshold).

Based on my understanding, least_nan_cols should return the complementary of the columns returned by many_nan_columns (given the same threshold value). Is that right?

Is there any reason to use an external parameter for the threshold in least_nan_cols method?

lorenz-gorini commented 4 years ago

least_nan_cols should return the complementary of the columns returned by many_nan_columns (given the same threshold value). Is that right?

Yes. That is correct. But in contrast to least_nan_cols, many_nan_columns should be employed only to identify trivial columns that usually should not be considered when developing models. Instead the idea behind least_nan_cols is to have a method that can return the columns with a fixed ration of NaN. For example it was useful when I needed to have rows with no NaN values (like for UMAP algorithm) and I needed to discard columns with a certain ratio of NaNs to avoid losing too many rows.

Anyway the function is quite simple and for these special situations, the same code could be rewritten in a script, or the function could be moved to UMAP scripts where it is required

alessiamarcolini commented 4 years ago

fixed by #66