Open zsz00 opened 2 years ago
Here is a way to get this information (and more/less depending on what statistics you want - see the describe
docstring for details)
julia> using DataFrames
julia> df = DataFrame(a=1:3, b='a':'c', c=rand(3))
3×3 DataFrame
Row │ a b c
│ Int64 Char Float64
─────┼───────────────────────
1 │ 1 a 0.820353
2 │ 2 b 0.186675
3 │ 3 c 0.419596
julia> describe(df)
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼──────────────────────────────────────────────────────────────────────
1 │ a 2.0 1 2.0 3 0 Int64
2 │ b a c 0 Char
3 │ c 0.475541 0.186675 0.419596 0.820353 0 Float64
julia> Base.summarysize(df)
804
info(df) as a information aggregation of: size(df) describe(df) summarysize(df)
Let us wait for others to comment.
I never needed such information in combination and the downside is that info
is just printing data (if I understand what it does correctly), while size
, describe
and summarysize
return objects that can be programmatically worked with.
Also providing a function that just prints something is quite challenging in design as then you need to define how it should behave in text, html and LaTeX backends.
Something like info
might be relevant in the context of Pluto.jl notebooks, I could imagine an html widget/output with this would be quite useful. One downside to DataFrames.jl’s describe
function in a report/document context is that without knowing the number of rows, the descriptive statistics (and nmissing
especially) are more difficult to correctly interpret.
without knowing the number of rows, the descriptive statistics
This is a good point.
@zsz00 - can you please propose a contract for info
?
Info() and describe() have different priorities: info is metadata of dataframe, [size, schema, memery_used] describe is statistics of column.
@zsz00 - thank you for linking the docsting of pandas info
.
What I would like to decide is contract for DataFrames.jl info
function that we would to add.
If you are unsure how to specify such a contract then can you please comment what information you would like to see.
In other words - if we decide to add such a function we need to write down exactly how we want this function to work in DataFrames.jl before it gets implemented; having a pandas reference is nice, but there is no 1 to 1 correspondence between pandas and DataFrames.jl data frame object specifications.
Also - do you want info
to work only on AbstractDataFrame
or also on DataFrameRow
and GroupedDataFrame
?
From the legal perspective I do not want to use what pandas does verbatim, as it has a different licensing model (BSD-3) than DataFrames.jl (MIT), but this is a minor thing. I rather prefer to work out what makes sense in the Julia ecosystem and implement it (even if it is very similar to pandas)
I would like to see informationI like this:
julia> info(df[, args])
4×3 DataFrame
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
Index info
memory usage: 248.0+ bytes
hope it work on AbstractDataFrame, DataFrameRow, GroupedDataFrame.
I'm not convinced we should add this. DataFrames tries to keep a simple and composable API so that users can master it, we don't want to add convenience functions just to mirror pandas or other implementations.
Regarding Pluto, I don't really see the advantage of having a dedicated function: wouldn't a widget be able to call describe
and summarysize
and concatenate the outputs?
The point that nmissing
is hard to interpret without knowing the number of rows is interesting, but separate I think. I'd rather print the proportion of missing value if that's the case.
info() and describe() have different functions: info() is get metadata of dataframe immediate. describe() is statistics on dataframe cols, It takes some computation time.
describe() + summarysize to get info, unnecessary cols statistical work was done.
df.info() in pandas is a useful function !
Hope DataFrames.jl have a same function.