JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
160 stars 56 forks source link

Capitalization of variable names #19

Closed dmbates closed 10 years ago

dmbates commented 10 years ago

Is it intentional that the capitalization of variables' names differs from that in the R data sets?

I'm currently revising Bates and Watts (1988), Nonlinear Regression Analysis and Its Applications, including examples in R and in Julia, Admittedly the capitalization of the variable names in R is wildly inconsistent and the capitalization in the RDatasets package is more consistent but it still becomes awkward explaining why the formulas are different in the two versions of an example.

For example, in R

> names(Puromycin)
[1] "conc"  "rate"  "state"

whereas in Julia,

julia> colnames(data("datasets","Puromycin"))
3-element Array{Union(ASCIIString,UTF8String),1}:
 "Conc" 
 "Rate" 
 "State"
johnmyleswhite commented 10 years ago

My goal in changing the case of names was to push for much higher consistency across datasets, but I'm happy to revert that decision if it's awkward.

If it's not too much trouble for you, I would like it if we were to still ensure that (1) no names are ever invalid Julia identifiers and (2) no names ever contain obvious misspellings.

dmbates commented 10 years ago

On thinking more about this I believe that consistency within the RDatasets package is more desirable than is consistency with the names in R which, as I mentioned, are not at all consistent.

I agree that the checking should include scanning for '.' embedded in a name - one of the unfortunate consequences of the age of the S language specification. (Originally "" was an assignment operator, interchangeable with '<-' in S and in R, because on some Teletype machines the "" was a left-pointing arrow. That convention still lives on in ESS where typing a single '' creates ' <- '. Because '' was in use, the '.' was used as a separator in names. Then came the convention of using '.' in a function name to indicate an S3 method.)

johnmyleswhite commented 10 years ago

Ok. If you're happy with consistency, then I think we can leave the current changes in place. If you find anywhere where we haven't consistently to make every column name into a valid name that uses initial-cap camelcase, please open an issue. There are a few datasets whose column names where sufficiently unclear to me that I didn't know how to fix them.

Really interesting to hear the history of the . convention in R.

randyzwitch commented 10 years ago

Late to the party, but I concur with leaving the new naming format; I went through the Gadfly documentation and fixed every example for the new format, so I'd prefer it didn't change again :)

johnmyleswhite commented 10 years ago

I think the format should be stable, but there's a few data sets left in there that don't fully match the format yet.

garborg commented 10 years ago

I think the names are pretty consistent at this point, but definitely open an issue or PR if you come across any inconsistencies.