georgian-io-archive / foreshadow

An automatic machine learning system
https://foreshadow.readthedocs.io
Apache License 2.0
29 stars 2 forks source link

add Foreshadow.get_data_summary() public api #217

Closed jichaoz closed 4 years ago

jichaoz commented 4 years ago

Description

Add a public API in Foreshadow to get a training data summary in a DataFrame. Also fixed or suppressed a couple of warnings. The data summary will look like the following:

                  Pclass             Sex        SibSp        Parch              Cabin     Embarked PassengerId           Age                                             Ticket  \
intent       Categorical     Categorical  Categorical  Categorical        Categorical  Categorical   Droppable       Numeric                                            Numeric   
count                712             712          712          712                712          712         712           712                                                712   
nan_pct                0               0            0            0            77.6685     0.280899           0       19.6629                                            25.8427   
unique                 3               2            7            7                117            3         712            83                                                425   
#1_value        3 55.90%     male 65.59%     0 67.98%     0 75.98%  C23 C25 C27 0.56%     S 73.74%   891 0.14%    24.0 3.65%                                       1601.0 0.84%   
#2_value        1 78.79%  female 100.00%     1 91.01%     1 89.19%      B96 B98 0.98%     C 91.29%   277 0.28%    22.0 6.88%                                     347082.0 1.69%   
#3_value       2 100.00%                     2 94.24%     2 98.60%           G6 1.40%     Q 99.72%   308 0.42%    25.0 9.83%                                    3101295.0 2.39%   
#4_value                                     4 96.49%     5 99.02%      C22 C26 1.83%                306 0.56%   28.0 12.78%                                      19950.0 2.95%   
#5_value                                     3 98.31%     4 99.44%           F2 2.25%                305 0.70%   18.0 15.59%                                     113781.0 3.51%   
#6_value                                     8 99.30%     3 99.86%         E101 2.67%                304 0.84%   30.0 18.40%                                     349909.0 4.07%   
#7_value                                    5 100.00%    6 100.00%          B28 2.95%                303 0.98%   21.0 21.07%                                     382652.0 4.63%   
#8_value                                                                    D26 3.23%                302 1.12%   19.0 23.74%                                      29106.0 5.06%   
#9_value                                                                    B35 3.51%                299 1.26%   29.0 25.98%                                       4133.0 5.48%   
#10_value                                                                   C78 3.79%                298 1.40%   27.0 28.09%                                     110152.0 5.90%   
invalid_pct                                                                                                                0                                                  0   
mean                                                                                                                 29.4988                                             274769   
std                                                                                                                  14.5001                                             505561   
min                                                                                                                     0.42                                                695   
25%                                                                                                                       21                                            27703.5   
50%                                                                                                                       28                                             236852   
75%                                                                                                                       38                                             348123   
max                                                                                                                       80                                         3.1013e+06   
5_outliers                                                                                                      [80.0, 74.0]  [3101298.0, 3101296.0, 3101295.0, 3101295.0, 3...