ProjectSidewalk / sidewalk-quality-analysis

An analysis of Project Sidewalk user quality based on interaction logs
5 stars 3 forks source link

What to do with NaN cells? #62

Open jonfroehlich opened 2 years ago

jonfroehlich commented 2 years ago

Currently, we have 32 columns that have at least one empty cell. We need to figure out a good method to fill them in.

The following 32/135 column(s) have NaN data (23.70% of columns)
high_quality_manual                1444
label_severity_min                    1
label_severity_max                    1
label_severity_mean                   1
label_severity_sd                     1
curb_ramp_severity_min               99
curb_ramp_severity_max               99
curb_ramp_severity_mean              99
curb_ramp_severity_sd               152
missing_curb_ramp_severity_min      214
missing_curb_ramp_severity_max      214
missing_curb_ramp_severity_mean     214
missing_curb_ramp_severity_sd       326
obstacle_severity_min               345
obstacle_severity_max               345
obstacle_severity_mean              345
obstacle_severity_sd                512
surface_problem_severity_min        398
surface_problem_severity_max        398
surface_problem_severity_mean       398
surface_problem_severity_sd         567
no_sidewalk_severity_min            639
no_sidewalk_severity_max            639
no_sidewalk_severity_mean           639
no_sidewalk_severity_sd             790
tutorial_minutes                    251
tutorial_error_count                251
accuracy                              2
curb_ramp_accuracy                  106
missing_curb_ramp_accuracy          229
obstacle_accuracy                   404
surface_problem_accuracy            425
dtype: int64
Empty cells in these 32 column(s) will be replaced by the mean of their respective columns
<ipython-input-39-6f8a379c8883>:95: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df_users.fillna(df_users.mean(), inplace=True)