Labeled Data Preprocessing

VasLem commented 3 years ago

Working with @antosalerno

antosalerno commented 3 years ago

About risk dataset (shape: 608 x 15)

Transform c40 column into boolean type.
Remove column access: only unique value "public".
Column _accountno has duplicates: we have different rows referred to the same city, how to choose?
Columns with null values: _risks_to_city_s_watersupply (40), timescale (11), magnitude (159). How to fill them?
Convert timescale and magnitude columns into scales.

VasLem commented 3 years ago

Use c40 rows to fill NA rows. In order to do that, we need to join the two datasets together (2018_-_Cities_WaterActions + 2018-_Cities_Water_Risks), so that to increase the feature space. Duplicate rows need to be summed up, after text vectorization. I propose either the word2vec or Glove approach, for columns that include descriptions (after having removed stopwords). Also IDFT may come in handy, if we see that the pretrained models show significant discrepancy
After imputation, we need to visualize data and make sure that features are correlated with the labels. If not, then we need to iterate for weak rows, assuming they are NA, until we have managed to create a dataset that is coherent.
In fact the only thing we need to keep from this dataset is country, coordinates and the column risks_to_city_s_water_supply, all the others are required for the imputation

MDAIceland / WaterSecurity