[R-Forge #2197] A a simple labels attribute like in the Hmisc package for variable descriptions

Rdatatable / data.table

R's data.table package extends data.frame:

http://r-datatable.com

Mozilla Public License 2.0

3.6k stars 982 forks source link

[R-Forge #2197] A a simple labels attribute like in the Hmisc package for variable descriptions #623

Open arunsrinivasan opened 10 years ago

arunsrinivasan commented 10 years ago

Submitted by: Griffith Rees; Assigned to: Nobody; R-Forge link

One data management feature of stata which R lacks is descriptions of variables within the standard dataframe. The Hmisc package deals with this in a simple way: http://www.statmethods.net/input/variablelables.html. While this seems like a very trivial change, it allows large social science datasets with opaque variable names (have a look at the US Census) to actually be manageable within R without spending hours hand coding variable abbreviations to complicated variable names. If this were implemented, nicely written variable names (with spaces and special characters) could appear in tables and plots that are output straight to latex, without post-processing.

An example of how this could be used with the existing stata importer:

dta2data.table <- function(path) { dta <- read.dta(path) d <- data.table(dta) setlabel(d, attr(dta, "val.labels")) return(d) }

Thanks again for an excellent and supremely useful project :)

geponce commented 9 years ago

I second this feature and in relation to this, some of the functionality at Morpho (http://bit.ly/1Tzc7Nj) looks interesting and very related to what Griffith mentions above, I guess.

MichaelChirico commented 5 years ago

I think this issue needs a use case MRE.

If this were implemented, nicely written variable names (with spaces and special characters) could appear in tables and plots that are output straight to latex, without post-processing.

This seems to be quite a deep request and probably better suited to an add-on package as it will likely require S3 or S4 methods for columns to auto-replace their names with their labels.

without spending hours hand coding variable abbreviations to complicated variable names

I'm not seeing why this is the case. In the example, there weren't "hours spend hand coding" (unless that was already done upstream and is anyway moot) -- we simply copy the labels attribute onto the data.table object -- either the object itself, or onto the columns individually.

This is and has always been possible (though I agree quite poorly documented) in base R and hence data.table. So, barring a more specific example of the anticipated workflow/API, I vote to close

iago-pssjd commented 4 years ago

Todays, beyond Hmisc other packages like haven, labelled or sjlabelled contribute to manage labels in the tidyverse package family.

I am starting to learn data.table, but not having the posibility of managing labels could discourage me to go on with it. It may be the case of many other people, since variable labels and categorical variable value labels are very useful.

Thank you anyway for the great package.

jangorecki commented 4 years ago

It is even more important because non-native encoding in column names cannot be reliably handled everywhere, and it seems that we will have to force users to change their column names in some cases. In such case labels could still carry required column names in any encoding. @iago-pssjd could you maybe link a manual page that describes usage of those in some of the mentioned packages?

iago-pssjd commented 4 years ago

Yes, I link two pages for both labelled and sjlabelled, even when the second overlap the first a bit:

Thank you!

jangorecki commented 4 years ago

I had a brief look at links and it seems to be much broader approach. AFAIU what we really need is just an extra attribute, that has to be retained/handled during common operations

d = data.table(celsius = 20, fahrenheit = 68)
setlabels = function(x, labels) {
  setattr(d, "labels", labels)
}
setlabels(d, labels = c("°C","°F"))

and then handle that nicely in print.data.table, fwrite(yaml=TRUE)

print(d)
#        °C         °F  
#   celsius fahrenheit
#1:      20         68