Pandas-esque `describe` method

SebastianSzturo commented 2 years ago

Are there any plans for a pandas-like describe method that would combine some of the aggregate methods (count, mean, standard deviation, peak) into a table to give you a rough understanding of a dataset, or is that out of scope of this library?

1_CekpyUXF-TmP_m2YZCI7gg

halian-vilela commented 2 years ago

Hey @SebastianSzturo,

If I understood correctly it will a feature of Kino and Livebook in the near feature. There's already a PR on Kino where this feature is being worked on (almost ready tbh) and waiting to be released.

That's definitely something I also miss and I think will help a lot on having a broad view of datasets.

josevalim commented 2 years ago

I think we still need describe in here. Does dplyr have something similar?

halian-vilela commented 2 years ago

Oh, I see... I was too obsessed about the table view! Hahaha.

I don't think dplyr has a summarization other than glimpse.

We do have summary in base R, which gives a not so nice summary of each column:

> mtcars %>% summary()
      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am              gear      
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   Median :4.000  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688  
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000  
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000  
      carb      
 Min.   :1.000  
 1st Qu.:2.000  
 Median :2.000  
 Mean   :2.812  
 3rd Qu.:4.000  
 Max.   :8.000

But we do have - in another package called skimr - the function skim that returns a very nice table similar to describe:

 mtcars %>% skim()
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             32        
Number of columns          11        
_______________________              
Column type frequency:               
  numeric                  11        
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   skim_variable n_missing complete_rate    mean      sd    p0    p25    p50    p75   p100 hist 
 1 mpg                   0             1  20.1     6.03  10.4   15.4   19.2   22.8   33.9  ▃▇▅▁▂
 2 cyl                   0             1   6.19    1.79   4      4      6      8      8    ▆▁▃▁▇
 3 disp                  0             1 231.    124.    71.1  121.   196.   326    472    ▇▃▃▃▂
 4 hp                    0             1 147.     68.6   52     96.5  123    180    335    ▇▇▆▃▁
 5 drat                  0             1   3.60    0.535  2.76   3.08   3.70   3.92   4.93 ▇▃▇▅▁
 6 wt                    0             1   3.22    0.978  1.51   2.58   3.32   3.61   5.42 ▃▃▇▁▂
 7 qsec                  0             1  17.8     1.79  14.5   16.9   17.7   18.9   22.9  ▃▇▇▂▁
 8 vs                    0             1   0.438   0.504  0      0      0      1      1    ▇▁▁▁▆
 9 am                    0             1   0.406   0.499  0      0      0      1      1    ▇▁▁▁▆
10 gear                  0             1   3.69    0.738  3      3      4      4      5    ▇▁▆▁▂
11 carb                  0             1   2.81    1.62   1      2      2      4      8    ▇▂▅▁▁

Maybe we should follow this path... 😄

kimjoaoun commented 2 years ago

I think we still need describe in here. Does dplyr have something similar?

It doesn't but I think that's because, as said by @halian-vilela, base R does have a feature to do summary statistics.

And Wickham thought that this feature shouldn't be in dplyr. In our case, I do agree that we should have a describe/1 since we do not have an Elixir built-in function to do so. I prefer describe() over summary() because the last one is too similar to the already existing summarise/2.

cigrainger commented 2 years ago

This should totally be in Explorer :). It doesn't require new dependencies and it would be obnoxious for it to be a dedicated library. Not only is it useful, but I think it's a good opportunity for a blog post showing what you can do with Explorer. I'd like to jump on this one if that's okay :grin:.

halian-vilela commented 2 years ago

Hey @cigrainger!

Nice, I would suggest that you take a look at the mental model {skimr} uses to implement it. I really like the way they separate the summary by the column types. I agree that - for Explorer and Elixir - it's overkill to have a dedicated library just for that, but also, exactly for being dedicated, {skimr} has a lot of small features and allow for little customizations.

I would say that, at least for a first moment, taking a broader look at those features and implementing a cherry picked set of them would give a great result!

jc00ke commented 2 years ago

I'm excited to see an open issue for this. We may be in need of profiling in the near term. I have no experience with Python or pandas, but I did just find pandas-profiling which is maybe the 2.0 of this describe/1 feature.

josephmachado commented 2 years ago

Excited to see explorer catching up to Pandas. I'm a DE and pandas describe is always the first thing I do when I am exploring the data. Would be create to see this in explorer. 🎉

elixir-explorer / explorer

Pandas-esque `describe` method #157