JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
160 stars 56 forks source link

Column of row indices left in the iris csv file #6

Closed dmbates closed 10 years ago

dmbates commented 11 years ago
julia> iris=data("datasets","iris")
150x6 DataFrame:
              Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]        1          5.1         3.5          1.4         0.2    "setosa"
[2,]        2          4.9         3.0          1.4         0.2    "setosa"
[3,]        3          4.7         3.2          1.3         0.2    "setosa"
[4,]        4          4.6         3.1          1.5         0.2    "setosa"
[5,]        5          5.0         3.6          1.4         0.2    "setosa"
[6,]        6          5.4         3.9          1.7         0.4    "setosa"
[7,]        7          4.6         3.4          1.4         0.3    "setosa"
[8,]        8          5.0         3.4          1.5         0.2    "setosa"
[9,]        9          4.4         2.9          1.4         0.2    "setosa"
[10,]      10          4.9         3.1          1.5         0.1    "setosa"

The data set should not have the first (and unnamed) column.

I haven't checked other data sets yet. This may be a widespread "infelicity".

dmbates commented 11 years ago

It is widespread. When you write the csv files from R you should use the optional argument row.names = FALSE

johnmyleswhite commented 11 years ago

This is a problem we inherited from the original repo that provided these files. I was hesitant to get out of sync with this repo, but agree that the row index column is annoying. I'll go through and pull them out.

dmbates commented 11 years ago

I wrote an R script to dump the data sets from a package using the row names only when they are useful

#!/usr/bin/env Rscript

dump_pkg_datasets <- function(pkg_nms) {
    for (pnm in as.character(pkg_nms)) {
        if (require(pnm, character=TRUE)) {
            pos = paste("package", pnm, sep=":")
            dnms = ls(pos=pos)
            suppressWarnings(dir.create(pnm))
            for (nm in ls(pos=pos)) {
                dd = get(nm, pos=pos)
                if (is.data.frame(dd)) {
                    print(nm)
                    rn = row.names(dd)
                    use_row_names = !(is.null(rn) || all(rn == 1:nrow(dd)))
                    write.csv(dd, quote=FALSE,
                              file=file.path(pnm, paste(nm, "csv", sep=".")),
                              row.names=use_row_names)
                }
            }
        }
    }
}

dump_pkg_datasets(commandArgs(trailingOnly=TRUE))
q("no")

Copy to a file, chmod +x and run it from the shell with the name(s) of one or more packages.

johnmyleswhite commented 11 years ago

Thanks, Doug. I'll get to this in a bit.