KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
623 stars 159 forks source link

Added config option to treat the first row of a data file as header (current default) or not. #168

Closed crubb closed 7 years ago

crubb commented 7 years ago

Hey,

currently, all CSV, TSV, WSV and XLSX files are read with header = TRUE. In this pull request I propose to still treat this as a default, but also enable users to specify header = FALSE.

Best regards, crubb

connectedblue commented 7 years ago

Hi @crubb,

Nice patch.

I think the reason the tests have failed is because the new config parameter has to be configured in two files. You have made the change in defaults/config/default.dcf which is where defaults are applied if there is no config item specified during load.project().

However, you'll also want to update defaults/full/config/global.dcf so that new projects after create.project() have the new parameter automatically.

Not sure if there are also other reasons for the test failing, but this should catch some of them. There also appear to be problems with some of the reader tests, but that might be related.

On the patch itself, what's your opinion on how to set column names if none are specified in the raw data? Does R just assign defaults, and does the user then fix this in a munge script?

connectedblue commented 7 years ago

@crubb the other thing you might want to do is build the project and the push a commit after the build. That will generate the actual man pages from the .Rd files.

I do this using the devtools package and run the commands:

        setwd(<ProjectTemplateDir>)
        library(devtools)
        document()
        install(".")
crubb commented 7 years ago

Hey @connectedblue,

R gives default names (X1, X2, X3, ..., Xn). It's the default behavior of e.g. the read.csv() function. So I suppose that should be fine.

Apparently the tests are failing, because of Undefined global functions or variables: config, if I get that correctly. That's strange, I'm already using it in a project and it is working fine.

I had run roxygen2 on the package already, and there is a man page? What is missing?

Best regards, crubb

connectedblue commented 7 years ago

Oh, yes, sorry my mistake. I didn't scroll down far enough to see the generated man page.

A quick look at the tests confirms the problem: there is no load.project() run in any of the test cases, hence no config variable in the global env. You may be able to fix that by adding the standard boiler plate to the relevant tests:

  test_project <- tempfile('test_project')
  suppressMessages(create.project(test_project, minimal = FALSE))
  on.exit(unlink(test_project, recursive = TRUE), add = TRUE)

  oldwd <- setwd(test_project)
  on.exit(setwd(oldwd), add = TRUE)
  suppressMessages(load.project())

You should perhaps think about adding a new test case or two for the case where the config variable is set to FALSE, just to make sure things don't break in the future.

You can run the tests locally by loading the testthat library and running tests/run-all.R. It's quick and gives some useful output to help debug where the tests are failing.

KentonWhite commented 7 years ago

Hi @crub, Thanks for the PR. Any progress on getting the tests to pass?

crubb commented 7 years ago

Will have next week free to look further into this. Never authored a test before cough ;)

Am 28.11.2016 um 22:11 schrieb Kenton White notifications@github.com:

Hi @crub https://github.com/crub, Thanks for the PR. Any progress on getting the tests to pass?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/johnmyleswhite/ProjectTemplate/pull/168#issuecomment-263395375, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgEdR-mBP80Oysi2l9ok1p0EXAlJjz2ks5rC0NlgaJpZM4K8-q6.

connectedblue commented 7 years ago

It looks scarier than it actually is. If you run the tests locally, you'll get a readout of which ones ar efailing because the config variable doesn't exist. The boiler plate above should fix that and you can then see the wood for the trees.

Also, if you get a chance, could you look at the thread at the bottom of #169 and see if you can see what's going on there. I don't think it's related to the sqlite patch you posted earlier in the year, but you probably have more knowledge than me of odbc reading.

connectedblue commented 7 years ago

Hi @crubb, sorry to interject while you're in the middle of this, but I've had an additional thought on this feature.

I don't think the config file parameter is the best solution for this problem. The first reason is that it is a reader specific config and there are lots of those that live in reader files rather than globals.dcf. The second reason is that it won't help for a project where you have two csv data sources - one with a header and one without.

I think the better home for this config parameter is in the .file definition where type is set to csv and an additional parameter header is passed into the csv.reader as ... options or similar.

The file.reader.R would need to be re-architected to support this behaviour, but I think it would lead to more flexible specification of data files to load.

What do you think? I'm happy to chip in and propose a new structure for file.reader.R if it helps.

KentonWhite commented 7 years ago

@crubb @connectedblue I'm happy with a global variable for whether to include headers in csv and xslx files. Currently you can work around this by assigning a global environment default for the csv reader.

I understand that whether to include headers could be a file by file issue. I think the purpose of this PR and discussion is about making explicit the default behavior for these readers. Currently we default to reading headers and this default is not well documented. Making a global config variable makes this default explicit and hopefully easier for most ProjectTemplate users to work with.

connectedblue commented 7 years ago

Fair point, I'd forgotten about the xls case when I made my comment. This is in fact a global option across a number of reader types.

There is still the issue of projects that have both types of data (maybe a CSV file with a header and an excel without) but this is an edge case.

Sorry @crubb, I'll let you get back to your PR.

crubb commented 7 years ago

Hey guys,

I finally came back to the PR after a few months, sorry ;) The tests run fine locally now, but Travis fails on:

checking for sqlext.h... no
configure: error: "ODBC headers sql.h and sqlext.h not found"
ERROR: configuration failed for package ‘RODBC’

I don't really think that is related to my code changes?

Best regards, crubb

connectedblue commented 7 years ago

Hi @crubb

I'm sure I've seen this before from travis - it seems to have days when it doesn't like it's own build.

This sounds like crappy advice, but try again tomorrow. You can force a retest by closing and re-opening the PR.

I also have travis configured against my ProjectTemplate clone so you're welcome to raise a test PR against connectedblue/ProjectTemplate2:master to see if it passes.

But you're right - there's no obvious link between that error and your code

KentonWhite commented 7 years ago

This sometimes happens. Travis builds a new environment from scratch and must download all of the required packages and unix programs. Sometimes the site hosting the package or program is down, which causes an error. No worries as I can usually restart the build and then it will pass. I'll let you know if that works here.

KentonWhite commented 7 years ago

Ok I've fixed the problem with Travis. The fix requires that you merge the current master into your branch and then run the PR again.

crubb commented 7 years ago

That worked for both of my current PRs, thank you!

connectedblue commented 7 years ago

Hi @crubb - good news that the build works cleanly now.

Sorry - only just spotted something: Can you update website/configuring.markdown also with the new variable?

I have a vague plan of being able to generate the man page and website documentation from one source for config, but for now both places need to be manually kept in sync. It's pretty much the same format as the man page so should be a cut and paste job.

KentonWhite commented 7 years ago

@crubb I'm a bit worried about the behavior on an old project when migrate.project() has not been run. In that instance, config$data_loading_header will be NULL, which will through an error in read.csv():

Error in !header : invalid argument type`

This could cause an old project to fail suddenly. Of course the user will be warned that they should run migrate.project().

How do you think this edge case (where a user upgrades but does not run migrate.project() should be handled?

KentonWhite commented 7 years ago

@crubb never mind -- I've done some more testing and digging, and the .load.config() function will handle the edge case. Merging and closing. Thanks!