Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

Various enhancements to print.data.table #1523

Open MichaelChirico opened 8 years ago

MichaelChirico commented 8 years ago

Current task list:


Some Notes

3 (tabled pending clarification)

As I understand it, this issue is a request to prevent the console output from wrapping around (i.e., to force all columns to appear parallel, regardless of how wide the table is).

If that's the case, this is (AFAICT) impossible, since that's something done by RStudio/R itself. I for one certainly don't know of any way to alter this behavior.

If someone does know of a way to affect this, or if they think I'm mis-interpreting, please pipe up and we can have this taken care of.

7

As I see it there are two options here. One is to treat all key columns the same; the other is to treat secondary, tertiary, etc. keys separately.

Example output:

set.seed(01394)
DT <- data.table(key1 = rep(c("A","B"), each = 4),
                 key2 = rep(c("a","b"), 4),
                 V1 = nrorm(8), key = c("key1","key2"))

# Only demarcate key columns
DT
#    | key1 | | key2 |         V1
#1: |    A | |    a |  0.5994579
#2: |    A | |    a | -1.0898775
#3: |    A | |    b | -0.2285326
#4: |    A | |    b | -1.7858472
#5: |    B | |    a | -0.6269875
#6: |    B | |    a | -0.6633084
#7: |    B | |    b |  1.0367084
#8: |    B | |    b |  0.7364276

# Separately "emboss" keys based on key order
DT
#    | key1 | || key2 ||         V1
#1: |    A | ||    a ||  0.5994579
#2: |    A | ||    a || -1.0898775
#3: |    A | ||    b || -0.2285326
#4: |    A | ||    b || -1.7858472
#5: |    B | ||    a || -0.6269875
#6: |    B | ||    a || -0.6633084
#7: |    B | ||    b ||  1.0367084
#8: |    B | ||    b ||  0.7364276

And of course, add an option for deciding whether to demarcate with | or some other user's-choice character (*, +, etc.)

9 [DONE]

Some feedback from a closed PR that was a first stab at solving this:

From Arun regarding preferred options:

col.names = c("auto", "top", "none")

"auto": current behaviour

"top": only on top, data.frame-like

"none": no column names -- exclude rows in which column names would have been printed.

10 [DONE]

It would be nice to have an option to print a row under the row of column names which gives each column's stored type, as is currently (I understand) the default for the output of dplyr operations.

Example from dplyr:

library(dplyr)
DF <- data.frame(n = numeric(1), c1 = complex(1), i = integer(1),
                 f = factor(1), D = as.Date("2016-02-06"), c2 = character(1),
                 stringsAsFactors = FALSE)
tbl_df(DF)
# Source: local data frame [1 x 6]
#
#       n     c1     i      f          D    c2
#   (dbl) (cmpl) (int) (fctr)     (date) (chr) # <- this row
#1     0   0+0i     0      1 2016-02-06      

Current best alternative is to do sapply(DF, class), but it's nice to have a preview of the data wit this extra information.

11

This seems closely related to 3. Current plan is to implement this as an alternative to 3 since it seems more tangible/doable.

Via @nverno:

Would it be useful for head.data.table to have an option to print only the head of columns that fit the screen width, and summarise the rest? I was imagining something like the printed output from the head of a tbl_df in dplyr. I think it is nice for tables with many columns.

and the guiding example from Arun:

require(data.table)
dt = setDT(lapply(1:100, function(x) 1:3))
dt
dplyr::tbl_dt(dt)

12

Currently covered by @jangorecki's PR #1448; Jan, assuming #1529 is merged first, could you edit the print.data.table man page for your PR?

arunsrinivasan commented 8 years ago

Just brilliant!

arunsrinivasan commented 8 years ago

No idea about 3 and 5 (as to what they mean). I think a PR for 6 would be nice (seems straightforward from what Jan wrote there). Perhaps ?print.data.table is the time consuming part? Do you think you'd be up for this, @MichaelChirico ? No idea as to what 7 means either.. 8 is another great idea. PR would be great!

arunsrinivasan commented 8 years ago

It'd be really nice if Github would allow assigning tasks to project who aren't necessarily members :-(.

arunsrinivasan commented 8 years ago

There's also https://github.com/Rdatatable/data.table/issues/1497

MichaelChirico commented 8 years ago

@arunsrinivasan should I try and PR this one issue at a time? Or in a fell swoop? I've got 8 basically taken care of, just need to add tests.

arunsrinivasan commented 8 years ago

Michael, separate PRs.

nverno commented 8 years ago

Very nice! Sorry to get back to you late on this, but Arun provided a nice example. It is just a nice convenience when interactively looking at tables with lots columns so your console isn't engulfed by a huge data dump when you take a look at the head. Ill close that other one.

arunsrinivasan commented 8 years ago

It'd be also nice to print:

primary key: secondary indices: , etc..

by default. It's definitely informative to know what the keys and secondary indices are..
arunsrinivasan commented 8 years ago

Also, I think this is better output for:

print(DT, class=TRUE)
   <char> <int> <num>
     site  date     x
1:      A     1    10
2:      A     2    20
3:      A     3    30
4:      B     1    10
5:      B     2    20
6:      B     3    30

It's easier to copy/paste the data.table without the classes in the way. If we can do that, we can turn on printing classes by default.

Thoughts?

MichaelChirico commented 8 years ago

@arunsrinivasan about printing keys:

About class:

This can be done, but will require a step of wrangling -- basically toprint <- rbind(rownames(toprint), toprint); rownames(toprint) <- abbs. Which is fine, I'm just curious why you're thinking of easier copy-pasting as a clear advantage? Not sure the cost of including class info in copy-pasted output. Happy to hear feedback.

arunsrinivasan commented 8 years ago

About class: -- copy pasting from SO, for example to provide input to fread(). I also find it easier without the separation between column name and value (just used to seeing it).

On printing keys:

primary key: <a, b>

clearly tells the first key column is "a", then "b"..

Does this clarify things a bit?

arunsrinivasan commented 8 years ago

I agree tables() could use an update.

MichaelChirico commented 8 years ago

@arunsrinivasan OK, I think I can get on board with that. Can ditch point # 7 then. I agree distinguishing key order at a glance was going to be tough. So how about:

Lastly, I propose sending this output through message to help distinguish it from the data.table itself visually.

arunsrinivasan commented 8 years ago

My suggestion would be this:

  1. If either of these attributes are not present, don't print them. I think people will quickly learn that no keys are set (if it isn't displayed).
  2. Since there can be more than 1 secondary index, I suggest the format be:

Keys: <col1, col2> (only one) Secondary Indices: , , <col1, col2>, ... If there are more than 'x' (=5 to begin with?) indices, use a "...". They can always access it using key2().

I don't mind "<>" being replaced with "" if that'd be more aesthetically pleasing.. e.g., "col1,col2", "col1" etc..

Last proposal: seems nice, but I wonder if it might create issues wth knitr when people suppress 'messages' in chunk.. and print the output?

arunsrinivasan commented 8 years ago

It'd be great to have this and class=TRUE default for v1.9.8 already.. we'll see.

arunsrinivasan commented 8 years ago

One other thought:

Many people use "numeric" type when an integer type would suffice, and when "integer64" would fit the bill better. How about marking those columns somehow while printing?

instead of , perhaps >num< ?? that'll allow people to be aware of such optimisations as well..

arunsrinivasan commented 8 years ago

OR "!num!"? There's a function isReallyReal that checks this. But this'll perhaps be too time consuming to run on all rows every time..

MichaelChirico commented 8 years ago

@arunsrinivasan Hmm I think it's definitely not something to be used as a part of print.data.table default.

Some initial musings:

Are you thinking of pushing 1.9.8 soon?

Oh, one more thing, what do you think about porting print.data.table to its own .R file?

arunsrinivasan commented 8 years ago

Hm, yes, let's forget the marking of columns for now.

On pushing 1.9.8: trying as much as possible to wrap the other issues marked as quick as possible. I'd like to work on non-equi joins for this release.

On print.data.table to separate file, sure, sounds good.

MichaelChirico commented 8 years ago

@arunsrinivasan just a heads up that setting class = TRUE as the default is causing 100s of errors in the tests

arunsrinivasan commented 8 years ago

Okay thanks, will take a look.

MichaelChirico commented 8 years ago

@arunsrinivasan nvm, on second glance, it's a lot, but manageable. Have to fix ~ 25 tests. Working now...

arunsrinivasan commented 8 years ago

Great! No hurry. Take your time.

jangorecki commented 8 years ago

I'm not really convinced about changing default on printing class. I'm not finding it useful in print, I use str to see classes (in dplyr for some reason they have glimpse function for that purpose). Isn't that better for print to by default just print the data, and use str to print classes and key/indexes?

franknarf1 commented 8 years ago

I agree with @jangorecki that class=FALSE default is preferable. I value my screen real estate and usually don't need reminders about columns' classes. Ditto for keys and indices. I like these features, but would expect them to be off by default.

arunsrinivasan commented 8 years ago

Thanks for your input. I do think it's useful. Unless there's a strong reason (+ vote) against this, I'd like to give it a go. Maybe a lot others might prefer it.

Perhaps we can put the keys / indices on hold. But I don't think 1 row for class types is taking away your screen's real estate.

arunsrinivasan commented 8 years ago

@MichaelChirico can we make the 'keys' argument FALSE for this release? Perhaps we can turn it on in the next one seeing how this one goes.

MichaelChirico commented 8 years ago

@arunsrinivasan sure. Will handle this after we iron out the update to class.

I agree with Frank that having it by default may be somewhat information overload... perhaps there's a middle ground (only print class if there's been a change in class for some column, e.g.).

Anyway happy to give setting class = TRUE as default a whirl.

jangorecki commented 8 years ago

Do we have any script that can be run to check packages that depends on data.table? Asking because potentially any package that tests output with Rout - Rout.save (or capture.output - I have 2 such non-CRAN pkgs) could be broken after changing default print. It is valuable to run such tests before and after to see the impact precisely. Then depending on the percentage of affected CRAN package would be best to decide.

arunsrinivasan commented 8 years ago

@jangorecki, good point. class=FALSE then for now. I'll come back to these issues later. Not important for now.

jangorecki commented 8 years ago

Any plans for minimalistic version of print key with * star prefix? or other nice ascii symbol? something like:

setkey(DT, site, date)
options("datatable.key.note"=TRUE)
print(DT)
#    *site *date     x
#1:      A     1    10
#2:      A     2    20

It would be my preferred one.

MichaelChirico commented 8 years ago

@jangorecki I'm fine with any way, but the resistance that cropped up with an approach like that is some people preferred to see key order as well, e.g.:

#    *site **date     x

In any case, if implemented, I would: set * as the default, and leave an option for making it whatever you want.

jangorecki commented 8 years ago

@MichaelChirico On one hand multiple starts are OK but if you would have on 20 columns in key? Maybe single star only if the order of key columns is the same as data columns, for me that would be in ~99% cases.

up to 3 elements there are ascii numbers:

#    ¹*site ²*date     x
mbacou commented 8 years ago

@MichaelChirico about 3) above, one can use R global options:

width.user <- options("width")
options(width=as.integer(howWideIsDT)) # temporarily resize the output console
print(DT)
options(width=width.user) # reset to user's preferences
MichaelChirico commented 8 years ago

@mbacou thanks for the input!

In RStudio, at least, I don't see a difference in output having done that.

franknarf1 commented 8 years ago

@MichaelChirico You should see a difference. Try

library(data.table)
options(width=500)
(DT = data.table(matrix(1:1e3,1)))

RStudio wraps console output and offers no option to disable this "feature"; while base R console overflows with no wrapping until options()$width. Either way you should see a difference. Try resizing your console window to see the wrapping in action.

mbacou commented 8 years ago

Might be useful to add an optional format argument similar to knitr::kable() or type in ascii::print() to generate markdown, pandoc, rst, textile, (etc.) and org-mode compatible table formats?

I often use snippets like these to paste results into e-mails and org or markdown documents:

print(ascii(x, digits=2), type="org")
# |   | ISO3 | ADM0_NAME                   | ELEVATION     | whea_h   |
# |---+------+-----------------------------+---------------+----------|
# | 1 | TZA  | United Republic of Tanzania |               | 19.00    |
# | 2 | TZA  | United Republic of Tanzania | (3e+02,5e+02] | 0.00     |
# | 3 | TZA  | United Republic of Tanzania | (5e+02,9e+02] | 743.00   |
# | 4 | TZA  | United Republic of Tanzania | (9e+02,1e+03] | 9519.00  |
# | 5 | TZA  | United Republic of Tanzania | (1e+03,2e+03] | 29814.00 |
# | 6 | TZA  | United Republic of Tanzania | (2e+03,5e+03] | 894.00   |

knitr::kable(x, format="markdown")
# |ISO3 |ADM0_NAME                   |ELEVATION     | whea_h|
# |:----|:---------------------------|:-------------|------:|
# |TZA  |United Republic of Tanzania |NA            |     19|
# |TZA  |United Republic of Tanzania |(3e+02,5e+02] |      0|
# |TZA  |United Republic of Tanzania |(5e+02,9e+02] |    743|
# |TZA  |United Republic of Tanzania |(9e+02,1e+03] |   9519|
# |TZA  |United Republic of Tanzania |(1e+03,2e+03] |  29814|
# |TZA  |United Republic of Tanzania |(2e+03,5e+03] |    894|
MichaelChirico commented 8 years ago

@mbacou not quite convinced of the utility of adding this to print.data.table when ascii::print and knitr::kable already seem to do a fine job...

mbacou commented 8 years ago

Agreed. I'd vote for minimal output as well, but if you plan to provide more fancy printing options, then using a table format that pandoc can process would make sense.

franknarf1 commented 7 years ago

A minor thing, but it might be a good idea to export print.data.table. I only noticed it was hidden when typing args(print.data.table) just now.

MichaelChirico commented 7 years ago

@franknarf1 any other reason? we have ?print.data.table now and args(data.table:::print.data.table) have that covered. was just about to file the export in a PR, but stopped myself. i don't think it's uncommon for print methods to be hidden (see print.lm/print.glm in base, e.g.)

franknarf1 commented 7 years ago

@MichaelChirico Nope. Not a problem unexported as you say; thanks for asking.

franknarf1 commented 7 years ago

Another idea: an option dput = TRUE, that will write reproducible code (since dput(DT) doesn't work). Something like

dtput = function(DT){
  d0 = capture.output(dput(setattr(data.table:::shallow(DT), ".internal.selfref", NULL)))
  cat("data.table::alloc.col(", d0, ")\n", sep="\n")
}

# example
library(data.table)
DT = as.data.table(as.list(1:10))
dtput(DT)
# which writes...
data.table::alloc.col(
structure(list(V1 = 1L, V2 = 2L, V3 = 3L, V4 = 4L, V5 = 5L, V6 = 6L, 
    V7 = 7L, V8 = 8L, V9 = 9L, V10 = 10L), .Names = c("V1", "V2", 
"V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(NA, 
-1L), class = c("data.table", "data.frame"))
)

... except less hacky and embedded in print.data.table. I guess if dput = TRUE, all the others can be ignored. Getting fancy, maybe allow dput = "file.txt" like dput() does. I figure it makes enough sense to put it in print, and it's not worth it to add a new function.

franknarf1 commented 6 years ago

Another idea similar to those in #645 : turn off smart truncation of list column display: example from SO.

I see this truncation pretty frequently, and in some cases it'd be nice to see printing as if list column v was sapply(v, toString) instead.

MichaelChirico commented 6 years ago

@franknarf1 i think a very easy fix would be here:

paste(c(format(head(x,6), justify=justify, ...), if(length(x)>6)""),collapse=",")

change "" to "...". What do you think? I like toString, but should also come with a default width parameter, I'm not sure how to do that robustly.


actually, re-reading toString.default:

function (x, width = NULL, ...) 
{
    string <- paste(x, collapse = ", ")
    if (missing(width) || is.null(width) || width == 0) 
        return(string)
    if (width < 0) 
        stop("'width' must be positive")
    if (nchar(string, type = "w") > width) {
        width <- max(6, width)
        string <- paste0(strtrim(string, width - 4), "....")
    }
    string
}

It seems the default way of handling width is similar to what's currently implemented. I think limiting output based on on-screen width rather than truncating to the first few elements is better, no?

This approach also allows better user interaction since toString is S3-registered -- we (or end users) could write/customize toString.* methods for any use cases that arise. Perhaps add a colWidth parameter to print.data.table which would be dropped into width of toString.default...

franknarf1 commented 6 years ago

@MichaelChirico One point in favor of the trailing "," over a ",..." is that it saves horizontal space. Nonetheless, that seems like a good change, since most users won't guess what the "," means.

Rather than that change, I was more interested in was printing a higher number of entries in place of 6 in head(x, 6), like your colWidth idea.

Re methods, I'd find an argument like formatters = list(character = function(x) toString(x), lm = function(x) x$qr$tol) easy to use (to be used for list columns provided every element matches the named class or is NULL). Not sure if that's what you meant.

jsams commented 6 years ago

Thought I would drop a mention of #2893 here as the two seem closely related.

franknarf1 commented 6 years ago

(Similar to my last comment...) Having a data.table like...

library(data.table)
(DT <- data.table(id = 1:2, v = numeric_version("0.0.0")))
#   id                 v
# 1:  1 <numeric_version>
# 2:  2 <numeric_version>

I cannot really read the contents of my list column, even though there is a print method for it.

It would be nice to have a way to tell data.table how I want a list column of a certain class printed, like ...

library(magrittr)

formatters = list(numeric_version = as.character)

printDT = data.table:::shallow(DT)
left_cols = which(sapply(DT, is.list))
for (i in seq_along(formatters)){
    if (length(left_cols) == 0L) break 
    alt_cols = left_cols[ sapply(DT[, ..left_cols], inherits, names(formatters)[i]) ]    
    if (length(alt_cols)){
      printDT[, (alt_cols) := lapply(.SD, formatters[[i]]), .SDcols = alt_cols][]
      left_cols = setdiff(left_cols, alt_cols)
    }
}
print(printDT)

   id     v
1:  1 0.0.0
2:  2 0.0.0

Could have that list passed by the user in options(datatable.print.formatters = formatters). To reduce the computational burden, I guess this would be done after filtering with nrows= and topn=.

HughParsonage commented 5 years ago

(If I want to suggest an addition to this list, do I add it here or add it as a discrete issue?)

MichaelChirico commented 5 years ago

you can just add it here. feel free to edit initial post but also include a comment w some exposition please

On Mon, Feb 4, 2019, 10:19 AM HughParsonage <notifications@github.com wrote:

(If I want to suggest an addition to this list, do I add it here or add it as a discrete issue?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/1523#issuecomment-460113509, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQQdd5pO_1tQjE7BL_B2i2dGeRN4p5yks5vJ5jNgaJpZM4HUz9_ .