JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Corner case for show of a data frame in REPL #1779

Closed bkamins closed 4 years ago

bkamins commented 5 years ago

@nalimilan - I understand that you are working on making show of a data frame in REPL more slick. Here is a corner case to test against (currently it is not rendered very nicely):

julia> df = DataFrame(x="a"^1000)
1×1 DataFrame. Omitted printing of 1 columns
│ Row │ │     │ ├─────┼
│ 1   │

julia> show(stdout, df, splitcols=true)
1×1 DataFrame
│ Row │ │     │ ├─────┼
│ 1   │

│ Row │ x

              │
│     │ String

              │
├─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
──────────────┤
│ 1   │ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa │
dgkf commented 4 years ago

I'd be happy to take this on to get introduced to the codebase. Just to round out some test cases before jumping into this, it appears to be an issue primarily with extra long cell contents (as in the original example)

julia> io = IOContext(IOBuffer(), :displaysize=>(100,20), :limit=>true)
julia> function fake_show(io, args...; kwargs...)
           show(io, args...; kwargs...)
           print(String(take!(io.io)))
       end

julia> fake_show(io, DataFrame(x = "a" ^ 10), allcols=true)  # limit of width
1×1 DataFrame
│ Row │ x           │
│     │ String      │
├─────┼─────────────┤
│ 1   │ aaaaaaaaaaa │

julia> fake_show(io, DataFrame(x = "a" ^ 11), allcols=true)  # overflows width
1×1 DataFrame
│ Row │ │     │ ├─────┼
│ 1   │ 

│ Row │ x           │
│     │ String      │
├─────┼─────────────┤
│ 1   │ aaaaaaaaaaa │

but can also be caused by the Type in the header

julia> io = IOContext(IOBuffer(), :displaysize=>(100,15), :limit=>true)
julia> fake_show(io, DataFrame(x = Union{Real,String}[1, "a"]), allcols=true)
2×1 DataFrame
│ Row │ │     │ ├─────┼
│ 1   │ │ 2   │ 

│ Row │ x      │
│     │ Union… │
├─────┼────────┤
│ 1   │ 1      │
│ 2   │ a      │

or by long column names

julia> fake_show(io, DataFrame(abcdefg = [1, 2]), allcols = true)
2×1 DataFrame
│ Row │ │     │ ├─────┼
│ 1   │ │ 2   │ 

│ Row │ abcdefg │
│     │ Int64   │
├─────┼─────────┤
│ 1   │ 1       │
│ 2   │ 2       │

Option 1: truncation

julia> <fixed show option 1>(io, DataFrame(x = "a" ^ 11), allcols=true) 
1×1 DataFrame
│ Row │ x           │
│     │ String      │  # ideally with "\e[90m…\e[39m" colored text
├─────┼─────────────┤  # similar to header type, to disambiguate from an 
│ 1   │ aaaaaaaaaa… │  # actual part of the string

Option 2: wrapping output that can't fit within display width

julia> <fixed show option 2>(io, DataFrame(x = "a" ^ 11), allcols=true) 
1×1 DataFrame
│ Row │ x           │
│     │ String      │
├─────┼─────────────┤
│ 1   │ aaaaaaaaaaa │  # overflow to multi-line rows
│     │ a           │

Handling extreme cases:

It should probably just print out a message that the display width needs to increase if there's less than or equal to only enough space for a single character from a single column to print. Something like:

1x1   | # 6 char display size
DataFr|
ame   |
displa| # use "\e[90m…\e[39m" colored text for "display too narrow" text
y     |
too   |
narrow|

If the goal is to get something cleaner worked out as quickly as possible, I'd be inclined to do option 1, but if faithfulness to showing all cell show output is a priority then perhaps that would break this goal and option 2 would be preferred. I could definitely see text wrapping being helpful for things like paragraphs of text in a single cell, but it seems like a pretty niche use case.

Proposal:

Go with option 1, truncate column names, header Types and cell text down to a minimum of 1 visible character and an ellipsis (always displayed in faded font), displaying a message to increase display size for anything narrower than that.

bkamins commented 4 years ago

Thank you for a detailed analysis.

My initial thought was just to special case the situation where we decide not to display any columns and just print something similar to what is printed when DataFrame() is shown. This should be simplest to do. Please keep in mind that we support different MIME apart from plain text (LaTeX and HTML) that probably should be fixed to (I have not checked if we have a problem then).

Now, if you wanted to invest more time into the issue then either option 1 or option 2 (where one could use some symbol to signal that the output was wrapped) would be really nice. However, I am afraid that option 2 can be very difficult to implement fully correctly. Note example the problem we currently have with:

julia> DataFrame(a=[DataFrame(a=1), DataFrame(b=2)])
2×1 DataFrame
│ Row │ a         │
│     │ DataFrame │
├─────┼───────────┤
│ 1   │ 1×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │   │
│ 2   │ 1×1 DataFrame
│ Row │ b     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 2     │   │

So my intuition - to get a balance between usefulness and simplicity would be Option 1, where we would:

then mincolwidth and maxcolwidth could have some default sensible values (fit for typical terminal sizes), that could be modified by IOContext.

(again - some thought should be given what to do with HTML and LaTeX)

This is what comes to my mind when I consider a possible "better" solution. But - to start anyway option "0" which you describe in "Handling extreme cases" is needed anyway so maybe it is best to start with it and then after merging it build up additional features.

dgkf commented 4 years ago

@bkamins option "0" which you describe in "Handling extreme cases"

I'm all for getting something simple and robust in there first and then trying to handle it more rigorously. I'm glad you brought up the other MIME types - I had overlooked those.

Just to add a reference to some prior art, in my R background I've used a very nice package called pillar that defines an API for developers to control formatting of their outputs in tibbles (tidyverse data.frames), with sensible defaults for when the API hasn't been implemented for specific classes. It also defines ways to parameterize cell output based on how "squeezed" the columns get. If it's on the roadmap to rework printing it might be worth just skimming that work to see how that ecosystem has broken up the display of data frames and what APIs they expose.

bkamins commented 4 years ago

I'm glad you brought up the other MIME types - I had overlooked those.

This is relevant as both HTML and LaTeX are needed for Jupyter Notebook interop.

If it's on the roadmap to rework printing

Actually the roadmap is to decouple printing from DataFrames.jl package (exactly following your initial comments about what should be included) and rely on an external package that would be imported internally. The reason is that table printing is generic for any Tables.jl compliant type.

The current most likely candidate is https://github.com/ronisbr/PrettyTables.jl. It seems mature enough and well maintained enough to justify the switch. Simply no-one has yet put enough effort to check how it could work. It does not have all the functionality (e.g. it does not print the number of column omitted + we would have to support all types we have with it: DataFrame, SubDataFrame, DataFrameRow, GroupedDataFrame, DataFrameRows, DataFrameColumns). Your pillar related ideas probably could be discussed with https://github.com/ronisbr/PrettyTables.jl devs.

Because of this roadmap we do not want to invest too much time into printing (but e.g. a PR providing the integration with PrettyTables.jl would be very interesting to see).

BTW. Just one more example to make sure it is fixed:

julia> df = DataFrame(a=["a"^200 for i in 1:200], b="b"^200)
200×2 DataFrame. Omitted printing of 2 columns
│ Row │ │     │ ├─────┼
│ 1   │ │ 2   │ │ 3   │ │ 4   │ │ 5   │ │ 6   │ │ 7   │ │ 8   │ │ 9   │ │ 10  │ │ 11  │ │ 12  │ │ 13  │ │ 14  │ │ 15  │ │ 16  │ │ 17  │ │ 18  │ │ 19  │ │ 20  │ │ 21
│ │ 22  │ │ 23  │
⋮
│ 177 │ │ 178 │ │ 179 │ │ 180 │ │ 181 │ │ 182 │ │ 183 │ │ 184 │ │ 185 │ │ 186 │ │ 187 │ │ 188 │ │ 189 │ │ 190 │ │ 191 │ │ 192 │ │ 193 │ │ 194 │ │ 195 │ │ 196 │ │ 197
│ │ 198 │ │ 199 │ │ 200 │