Closed SebastianSzturo closed 1 year ago
Hey @SebastianSzturo,
If I understood correctly it will a feature of Kino and Livebook in the near feature. There's already a PR on Kino where this feature is being worked on (almost ready tbh) and waiting to be released.
That's definitely something I also miss and I think will help a lot on having a broad view of datasets.
I think we still need describe in here. Does dplyr have something similar?
Oh, I see... I was too obsessed about the table view! Hahaha.
I don't think dplyr
has a summarization other than glimpse.
We do have summary
in base R, which gives a not so nice summary of each column:
> mtcars %>% summary()
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
But we do have - in another package called skimr
- the function skim
that returns a very nice table similar to describe
:
mtcars %>% skim()
── Data Summary ────────────────────────
Values
Name Piped data
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None
── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 mpg 0 1 20.1 6.03 10.4 15.4 19.2 22.8 33.9 ▃▇▅▁▂
2 cyl 0 1 6.19 1.79 4 4 6 8 8 ▆▁▃▁▇
3 disp 0 1 231. 124. 71.1 121. 196. 326 472 ▇▃▃▃▂
4 hp 0 1 147. 68.6 52 96.5 123 180 335 ▇▇▆▃▁
5 drat 0 1 3.60 0.535 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
6 wt 0 1 3.22 0.978 1.51 2.58 3.32 3.61 5.42 ▃▃▇▁▂
7 qsec 0 1 17.8 1.79 14.5 16.9 17.7 18.9 22.9 ▃▇▇▂▁
8 vs 0 1 0.438 0.504 0 0 0 1 1 ▇▁▁▁▆
9 am 0 1 0.406 0.499 0 0 0 1 1 ▇▁▁▁▆
10 gear 0 1 3.69 0.738 3 3 4 4 5 ▇▁▆▁▂
11 carb 0 1 2.81 1.62 1 2 2 4 8 ▇▂▅▁▁
Maybe we should follow this path... 😄
I think we still need describe in here. Does dplyr have something similar?
It doesn't but I think that's because, as said by @halian-vilela, base R does have a feature to do summary statistics.
And Wickham thought that this feature shouldn't be in dplyr. In our case, I do agree that we should have a describe/1
since we do not have an Elixir built-in function to do so. I prefer describe()
over summary()
because the last one is too similar to the already existing summarise/2
.
This should totally be in Explorer :). It doesn't require new dependencies and it would be obnoxious for it to be a dedicated library. Not only is it useful, but I think it's a good opportunity for a blog post showing what you can do with Explorer. I'd like to jump on this one if that's okay :grin:.
Hey @cigrainger!
Nice, I would suggest that you take a look at the mental model {skimr}
uses to implement it. I really like the way they separate the summary by the column types. I agree that - for Explorer and Elixir - it's overkill to have a dedicated library just for that, but also, exactly for being dedicated, {skimr}
has a lot of small features and allow for little customizations.
I would say that, at least for a first moment, taking a broader look at those features and implementing a cherry picked set of them would give a great result!
I'm excited to see an open issue for this. We may be in need of profiling in the near term. I have no experience with Python or pandas, but I did just find pandas-profiling which is maybe the 2.0 of this describe/1
feature.
Excited to see explorer catching up to Pandas. I'm a DE and pandas describe
is always the first thing I do when I am exploring the data. Would be create to see this in explorer. 🎉
Are there any plans for a pandas-like
describe
method that would combine some of the aggregate methods (count, mean, standard deviation, peak) into a table to give you a rough understanding of a dataset, or is that out of scope of this library?