JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

function request: info(df) as in pandas #2986

Open zsz00 opened 2 years ago

zsz00 commented 2 years ago

df.info() in pandas is a useful function !

Hope DataFrames.jl have a same function.

image

bkamins commented 2 years ago

Here is a way to get this information (and more/less depending on what statistics you want - see the describe docstring for details)

julia> using DataFrames

julia> df = DataFrame(a=1:3, b='a':'c', c=rand(3))
3×3 DataFrame
 Row │ a      b     c
     │ Int64  Char  Float64
─────┼───────────────────────
   1 │     1  a     0.820353
   2 │     2  b     0.186675
   3 │     3  c     0.419596

julia> describe(df)
3×7 DataFrame
 Row │ variable  mean      min       median    max       nmissing  eltype
     │ Symbol    Union…    Any       Union…    Any       Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────
   1 │ a         2.0       1         2.0       3                0  Int64
   2 │ b                   a                   c                0  Char
   3 │ c         0.475541  0.186675  0.419596  0.820353         0  Float64

julia> Base.summarysize(df)
804
zsz00 commented 2 years ago

info(df) as a information aggregation of: size(df) describe(df) summarysize(df)

bkamins commented 2 years ago

Let us wait for others to comment.

I never needed such information in combination and the downside is that info is just printing data (if I understand what it does correctly), while size, describe and summarysize return objects that can be programmatically worked with.

Also providing a function that just prints something is quite challenging in design as then you need to define how it should behave in text, html and LaTeX backends.

jeremiahpslewis commented 2 years ago

Something like info might be relevant in the context of Pluto.jl notebooks, I could imagine an html widget/output with this would be quite useful. One downside to DataFrames.jl’s describe function in a report/document context is that without knowing the number of rows, the descriptive statistics (and nmissing especially) are more difficult to correctly interpret.

bkamins commented 2 years ago

without knowing the number of rows, the descriptive statistics

This is a good point.

@zsz00 - can you please propose a contract for info?

zsz00 commented 2 years ago

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

image image

zsz00 commented 2 years ago

Info() and describe() have different priorities: info is metadata of dataframe, [size, schema, memery_used] describe is statistics of column.

bkamins commented 2 years ago

@zsz00 - thank you for linking the docsting of pandas info.

What I would like to decide is contract for DataFrames.jl info function that we would to add.

If you are unsure how to specify such a contract then can you please comment what information you would like to see.

In other words - if we decide to add such a function we need to write down exactly how we want this function to work in DataFrames.jl before it gets implemented; having a pandas reference is nice, but there is no 1 to 1 correspondence between pandas and DataFrames.jl data frame object specifications.

Also - do you want info to work only on AbstractDataFrame or also on DataFrameRow and GroupedDataFrame?

From the legal perspective I do not want to use what pandas does verbatim, as it has a different licensing model (BSD-3) than DataFrames.jl (MIT), but this is a minor thing. I rather prefer to work out what makes sense in the Julia ecosystem and implement it (even if it is very similar to pandas)

zsz00 commented 2 years ago

I would like to see informationI like this:

julia> info(df[, args])

4×3 DataFrame

Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64

Index info 

memory usage: 248.0+ bytes

hope it work on AbstractDataFrame, DataFrameRow, GroupedDataFrame.

nalimilan commented 2 years ago

I'm not convinced we should add this. DataFrames tries to keep a simple and composable API so that users can master it, we don't want to add convenience functions just to mirror pandas or other implementations.

Regarding Pluto, I don't really see the advantage of having a dedicated function: wouldn't a widget be able to call describe and summarysize and concatenate the outputs?

The point that nmissing is hard to interpret without knowing the number of rows is interesting, but separate I think. I'd rather print the proportion of missing value if that's the case.

zsz00 commented 2 years ago

info() and describe() have different functions: info() is get metadata of dataframe immediate. describe() is statistics on dataframe cols, It takes some computation time.

describe() + summarysize to get info, unnecessary cols statistical work was done.