MDAIceland / WaterSecurity

1 stars 1 forks source link

Feature selection #55

Closed VasLem closed 3 years ago

VasLem commented 3 years ago

Added Feature Generation and Selection part. @antosalerno @adriana-madi @bajo1207 @OlympiaG the feature selection part can be removed, if you pick a model that performs also feature selection, it is relatively easy to remove it. @ekaan can you add documentation to the python file when you have time later today?

VasLem commented 3 years ago

Risk: Higher water prices

Explained variation percentage per principal component: [61.611560540676514, 14.392035205828138, 6.978655586311352, 3.6765889379404935, 2.5773972654688033, 2.1456148918369906, 1.2565741167344016, 1.1335668356170254, 1.0067086430354297, 0.8534679314685039] Total percentage of the explained data by 10 components is: 95.63 Percentage of the information that is lost for using 10 components is: 4.37 Picked variable number: 10 Features select Specs Score
439 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Infants lacking immunization, DTP (% of one-year-olds)] 7.76825
1455 perc_poly__Feat[Unemployment, total (% of labour force)] * Feat[Employment in agriculture (% of total employment)] 6.13585
2639 perc_poly__Feat[Gross enrolment ratio, upper secondary, both sexes (%)] * Feat[Prevalence of HIV, total (% of population ages 15-49)] 6.13184
295 perc_poly__Feat[Population with at least some secondary education, female (% ages 25 and older)] * Feat[Unemployment, total (% of labour force)] 6.11164
1456 perc_poly__Feat[Unemployment, total (% of labour force)] * Feat[Employment in services (% of total employment)] 6.03654
1740 perc_poly__Feat[Employment in agriculture (% of total employment)] * Feat[Municipal water withdrawal as % of total withdrawal] 6.02999
2005 perc_poly__Feat[Overall loss in HDI due to inequality (%)] * Feat[Municipal water withdrawal as % of total withdrawal] 5.83638
218 perc_poly__Feat[Population with at least some secondary education (% ages 25 and older)] * Feat[Unemployment, total (% of labour force)] 5.64476
520 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Unemployment, total (% of labour force)] 5.5885
1308 perc_poly__Feat[Unemployment, youth (% ages 15?24)] * Feat[Percentage of students in primary education who are female (%)] 5.52659

Risk: Inadequate or aging infrastructure

Explained variation percentage per principal component: [74.87709658177496, 14.602277912843489, 3.190126961316718, 1.4447942853728544, 1.08658233938848, 0.7879161541089548, 0.6610118837477541, 0.561494345043655, 0.4302827039175066, 0.40604084295520415] Total percentage of the explained data by 10 components is: 98.05 Percentage of the information that is lost for using 10 components is: 1.95 Picked variable number: 15 Features select Specs Score
449 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Employment to population ratio (% ages 15 and older)] 6.04373
2911 perc_poly__Feat[Percentage of female students enrolled in primary education who are over-age, female (%)] * Feat[Percentage of students in secondary education who are female (%)] 5.72702
1369 perc_poly__Feat[Private capital flows (% of GDP)] * Feat[Percentage of students in pre-primary education who are female (%)] 5.54383
3031 perc_poly__Feat[Percentage of students enrolled in primary education who are over-age, both sexes (%)] * Feat[Percentage of students in secondary education who are female (%)] 5.1383
1847 perc_poly__Feat[Working poor at PPP$3.20 a day (% of total employment)] * Feat[Industrial water withdrawal as % of total water withdrawal] 4.97675
1556 perc_poly__Feat[Youth not in school or employment (% ages 15-24)] * Feat[Population, female (% of total)] 4.86128
1557 perc_poly__Feat[Youth not in school or employment (% ages 15-24)] * Feat[Population, male (% of total)] 4.86128
1100 perc_poly__Feat[Gross fixed capital formation (% of GDP)] * Feat[Gross enrolment ratio, upper secondary, both sexes (%)] 4.73771
2937 perc_poly__Feat[Percentage of male students enrolled in primary education who are over-age, male (%)] * Feat[Percentage of students in secondary education who are female (%)] 4.57644
443 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Unemployment, youth (% ages 15?24)] 4.52001
465 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Gross enrolment ratio, primary, female (%)] 4.43906
1912 perc_poly__Feat[Gross capital formation (% of GDP)] * Feat[Gross enrolment ratio, pre-primary, male (%)] 4.31925
446 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Unemployment, total (% of labour force)] 4.29959
1910 perc_poly__Feat[Gross capital formation (% of GDP)] * Feat[Gross enrolment ratio, pre-primary, both sexes (%)] 4.24168
1911 perc_poly__Feat[Gross capital formation (% of GDP)] * Feat[Gross enrolment ratio, pre-primary, female (%)] 4.15378

Risk: Increased water stress or scarcity

Explained variation percentage per principal component: [56.49588784008235, 29.775887904318964, 3.192067775029895, 2.215176452112342, 1.430524570369619, 1.0529946852886123, 1.0070090560125087, 0.6617911411911048, 0.568967410519917, 0.5257576244921963] Total percentage of the explained data by 10 components is: 96.93 Percentage of the information that is lost for using 10 components is: 3.07 Picked variable number: 27 Features select Specs Score
575 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 21.6239
1791 perc_poly__Feat[Employment in services (% of total employment)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 19.7533
350 perc_poly__Feat[Population with at least some secondary education, female (% ages 25 and older)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 18.2396
3067 perc_poly__Feat[Percentage of students in pre-primary education who are female (%)] * Feat[Agricultural water withdrawal as % of total water withdrawal] 18.0407
1196 perc_poly__Feat[Inequality in education (%)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 17.9179
1736 perc_poly__Feat[Employment in agriculture (% of total employment)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 17.8274
273 perc_poly__Feat[Population with at least some secondary education (% ages 25 and older)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 16.8634
1845 perc_poly__Feat[Working poor at PPP$3.20 a day (% of total employment)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 16.1346
536 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Gross enrolment ratio, pre-primary, female (%)] 16.0199
3068 perc_poly__Feat[Percentage of students in pre-primary education who are female (%)] * Feat[Industrial water withdrawal as % of total water withdrawal] 15.5643
1600 perc_poly__Feat[Labour force participation rate (% ages 15 and older)] * Feat[Percentage of enrolment in secondary education in private institutions (%)] 15.2258
535 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Gross enrolment ratio, pre-primary, both sexes (%)] 14.9113
537 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Gross enrolment ratio, pre-primary, male (%)] 13.8573
184 pop_poly__Feat[Total population (millions)] * Feat[population_1k_density] 13.8306
578 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 13.6625
1390 perc_poly__Feat[Exports and imports (% of GDP)]^2 13.5843
1599 perc_poly__Feat[Labour force participation rate (% ages 15 and older)] * Feat[Percentage of enrolment in primary education in private institutions (%)] 13.5199
775 perc_poly__Feat[Labour force participation rate (% ages 15 and older), male] * Feat[Percentage of students in pre-primary education who are female (%)] 13.4784
2303 perc_poly__Feat[Gross enrolment ratio, pre-primary, female (%)] * Feat[Labor force, female (% of total labor force)] 13.2982
156 pop_poly__Feat[population] * Feat[Urban population (%)] 13.2529
2259 perc_poly__Feat[Gross enrolment ratio, pre-primary, both sexes (%)] * Feat[Labor force, female (% of total labor force)] 13.2224
426 perc_poly__Feat[Population with at least some secondary education, male (% ages 25 and older)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 13.2134
1197 perc_poly__Feat[Inequality in education (%)] * Feat[Agricultural water withdrawal as % of total water withdrawal] 13.1529
2346 perc_poly__Feat[Gross enrolment ratio, pre-primary, male (%)] * Feat[Labor force, female (% of total labor force)] 13.1356
1512 perc_poly__Feat[Youth not in school or employment (% ages 15-24)] * Feat[Labour force participation rate (% ages 15 and older)] 13.1342
1697 perc_poly__Feat[Employment in agriculture (% of total employment)] * Feat[Gross enrolment ratio, pre-primary, female (%)] 12.978
648 perc_poly__Feat[Urban population (%)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 12.9201

Risk: Declining water quality

Explained variation percentage per principal component: [78.27390879136566, 6.166052091257242, 3.282279668960054, 3.1163840169810704, 2.380639356658543, 1.097313877635352, 0.9093861859235939, 0.7804899479485267, 0.5492017083013799, 0.48874242531712797] Total percentage of the explained data by 10 components is: 97.04 Percentage of the information that is lost for using 10 components is: 2.96 Picked variable number: 19 Features select Specs Score
1565 perc_poly__Feat[Youth not in school or employment (% ages 15-24)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 13.7088
1568 perc_poly__Feat[Youth not in school or employment (% ages 15-24)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 13.5287
746 perc_poly__Feat[Labour force participation rate (% ages 15 and older), male] * Feat[Inequality in income (%)] 13.2073
777 perc_poly__Feat[Labour force participation rate (% ages 15 and older), male] * Feat[Percentage of students in secondary education who are female (%)] 12.116
2004 perc_poly__Feat[Overall loss in HDI due to inequality (%)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 12.0001
2038 perc_poly__Feat[Inequality in income (%)] * Feat[Percentage of students in secondary general education who are female (%)] 11.7702
2001 perc_poly__Feat[Overall loss in HDI due to inequality (%)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 11.02
1308 perc_poly__Feat[Unemployment, youth (% ages 15?24)] * Feat[Percentage of students in primary education who are female (%)] 10.9603
1263 perc_poly__Feat[Inequality in life expectancy (%)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 10.5991
530 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Inequality in income (%)] 10.4985
137 scaled__SDG 6.4.2. Water Stress 10.3154
3058 perc_poly__Feat[Percentage of students in pre-primary education who are female (%)] * Feat[Population, male (% of total)] 10.1918
3057 perc_poly__Feat[Percentage of students in pre-primary education who are female (%)] * Feat[Population, female (% of total)] 10.1918
720 perc_poly__Feat[Labour force participation rate (% ages 15 and older), female] * Feat[Agricultural water withdrawal as % of total renewable water resources] 10.0034
3140 perc_poly__Feat[Population ages 0-14 (% of total)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 9.99864
2780 perc_poly__Feat[Labor force, female (% of total labor force)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 9.90631
1898 perc_poly__Feat[Share of employment in nonagriculture, female (% of total employment in nonagriculture)] * Feat[Agricultural water withdrawal as % of total renewable water resources] 9.67468
3143 perc_poly__Feat[Population ages 0-14 (% of total)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 9.66008
1068 perc_poly__Feat[Infants lacking immunization, DTP (% of one-year-olds)] * Feat[MDG 7.5. Freshwater withdrawal as % of total renewable water resources] 9.36312

Risk: Increased water demand

Explained variation percentage per principal component: [74.35979959531807, 16.23639131615468, 3.052385317234916, 1.2490249413135015, 0.8597006844862726, 0.8058804595900894, 0.6070225334257546, 0.5299944687415014, 0.4672292781172945, 0.3178953000586926] Total percentage of the explained data by 10 components is: 98.49 Percentage of the information that is lost for using 10 components is: 1.51 Picked variable number: 10 Features select Specs Score
487 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Percentage of students in secondary education who are female (%)] 12.9493
447 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Youth not in school or employment (% ages 15-24)] 8.64105
2874 perc_poly__Feat[Percentage of enrolment in primary education in private institutions (%)] * Feat[Municipal water withdrawal as % of total withdrawal] 7.82076
1482 perc_poly__Feat[Unemployment, total (% of labour force)] * Feat[Percentage of enrolment in primary education in private institutions (%)] 7.6997
594 perc_poly__Feat[Urban population (%)] * Feat[Youth not in school or employment (% ages 15-24)] 7.58774
0 scaled__population 7.47841
1483 perc_poly__Feat[Unemployment, total (% of labour force)] * Feat[Percentage of enrolment in secondary education in private institutions (%)] 7.30762
1299 perc_poly__Feat[Unemployment, youth (% ages 15?24)] * Feat[Percentage of enrolment in primary education in private institutions (%)] 7.00005
2896 perc_poly__Feat[Percentage of enrolment in secondary education in private institutions (%)] * Feat[Unemployment, male (% of male labor force) (modeled ILO estimate)] 6.94724
486 perc_poly__Feat[Share of seats in parliament (% held by women)] * Feat[Percentage of students in primary education who are female (%)] 6.81146

Risk: Regulatory

Explained variation percentage per principal component: [60.84877916778261, 12.932677773264626, 6.030719106399093, 4.0221021556904875, 3.0887947680546803, 2.4727109196783488, 1.7777044845148402, 1.6435185695388081, 1.1437258163666977, 1.0042779571148468] Total percentage of the explained data by 10 components is: 94.97 Percentage of the information that is lost for using 10 components is: 5.03 Picked variable number: 10 Features select Specs Score
1397 perc_poly__Feat[Exports and imports (% of GDP)] * Feat[Working poor at PPP$3.20 a day (% of total employment)] 12.0832
277 perc_poly__Feat[Population with at least some secondary education (% ages 25 and older)] * Feat[Municipal water withdrawal as % of total withdrawal] 11.6903
354 perc_poly__Feat[Population with at least some secondary education, female (% ages 25 and older)] * Feat[Municipal water withdrawal as % of total withdrawal] 11.6719
430 perc_poly__Feat[Population with at least some secondary education, male (% ages 25 and older)] * Feat[Municipal water withdrawal as % of total withdrawal] 11.4424
1740 perc_poly__Feat[Employment in agriculture (% of total employment)] * Feat[Municipal water withdrawal as % of total withdrawal] 10.9302
1204 perc_poly__Feat[Inequality in life expectancy (%)] * Feat[Exports and imports (% of GDP)] 10.5559
579 perc_poly__Feat[Vulnerable employment (% of total employment)] * Feat[Municipal water withdrawal as % of total withdrawal] 10.3693
1400 perc_poly__Feat[Exports and imports (% of GDP)] * Feat[Overall loss in HDI due to inequality (%)] 9.13338
652 perc_poly__Feat[Urban population (%)] * Feat[Municipal water withdrawal as % of total withdrawal] 9.1217
1140 perc_poly__Feat[Inequality in education (%)] * Feat[Exports and imports (% of GDP)] 8.62565

Risk: Energy supply issues

Explained variation percentage per principal component: [56.591930731505336, 17.827250608775376, 5.099604567700036, 4.331467067036899, 3.0868188081393892, 2.699519986543184, 1.882391125355395, 1.8205347861215964, 1.2734765640436936, 0.8998715916627013] Total percentage of the explained data by 10 components is: 95.51 Percentage of the information that is lost for using 10 components is: 4.49 Picked variable number: 10 Features select Specs Score
3183 perc_poly__Feat[Population, female (% of total)] * Feat[Unemployment, male (% of male labor force) (modeled ILO estimate)] 19.9001
3196 perc_poly__Feat[Population, male (% of total)] * Feat[Unemployment, male (% of male labor force) (modeled ILO estimate)] 19.9001
2600 perc_poly__Feat[Gross enrolment ratio, secondary, male (%)] * Feat[Population growth (annual %)] 18.3182
3184 perc_poly__Feat[Population, female (% of total)] * Feat[Unemployment, total (% of total labor force) (modeled ILO estimate)] 17.5992
3197 perc_poly__Feat[Population, male (% of total)] * Feat[Unemployment, total (% of total labor force) (modeled ILO estimate)] 17.5992
2525 perc_poly__Feat[Gross enrolment ratio, secondary, both sexes (%)] * Feat[Population growth (annual %)] 16.5388
2636 perc_poly__Feat[Gross enrolment ratio, upper secondary, both sexes (%)] * Feat[Population growth (annual %)] 15.1723
2563 perc_poly__Feat[Gross enrolment ratio, secondary, female (%)] * Feat[Population growth (annual %)] 14.3133
889 perc_poly__Feat[Foreign direct investment, net inflows (% of GDP)] * Feat[Gross enrolment ratio, lower secondary, male (%)] 13.6433
114 scaled__Population, male (% of total) 13.1978
VasLem commented 3 years ago

Crazy outcomes, particularly related to the "Share of seats in parliament (% held by women)" . It seems that some features have been picked to relate to different civilizations/standards etc. We can get some pretty nice insights by looking at this data

VasLem commented 3 years ago

@ekaan also needs to add documentation, then we can merge it

bajo1207 commented 3 years ago

Crazy outcomes, particularly related to the "Share of seats in parliament (% held by women)" . It seems that some features have been picked to relate to different civilizations/standards etc. We can get some pretty nice insights by looking at this data

@adriana-madi This needs to go in the report somehow. Maybe we could make a tag cloud or some other representation?

VasLem commented 3 years ago

So @antosalerno had asked to upload a csv of the augmented dataset, which is "huge" (60Mb), so instead I uploaded a notebook called ClassificationOnAugmentedFeatures.ipynb instead, which one can use as a template, to run the code inside and get the augmented features, it is a deterministic process, so I assume that there is no reason to save it as a csv and pollute Github