use of .R files in the data directory

connectedblue commented 7 years ago

I've run into an issue on one of my projects which @KentonWhite highlighted when #160 was being developed to auto cache data during load.project()

This file demonstrates the issues, and how I worked around them:

https://github.com/connectedblue/capstone/blob/master/capstone/data/corpus.R

In this project, I am trying to create a test data set from the original data during my exploratory analysis (I use a custom config variable flag to toggle between the full and testing set to save rewriting the data acquisition code).

The way .R files are treated in the data directory results in the following issues:

multiple variables can be created which may or may not be intended for project analysis. In my file, the variables corpus_dir, docs and doc are side effects used to create the intended variable corpus. I wrapped all the processing into a function so that these variables did not find their way into the global environment.
- unless a variable is called the same name as the .R file, the data file will always be processed in the data directory, This duplicates processing unnecessarily when the variables are already in memory or the cache. I got around this by making the generated variable the same name as the file, and it is skipped during load.project() as a result.

These same issues would also occur in other data loaders such as excel and sql where the datafile can generate multiple global environment variables.

I have a proposed solution for the .R problems above, but I want to get the community's view before submitting a PR.

Essentially I propose using convention over configuration, and enforce a rule that .R files in the data directory can only produce one global variable with the same name as the file. Any other global variables would be deleted which means that side effect variables like mine above don't have to be wrapped in a function to avoid polluting the global environment.

We'd have to put some migration logic in to warn the user about any legacy code that would need to be looked at.

I think it's good practice anyway to have one variable per .R file in the data directory, and enforcing it brings more benefits to users when they run load.project() making it faster to load by relying on memory and disk caching.

The change would be quite straightforward - I'm happy to do it, but I don't want to get distracted from my project so it'll be a couple of weeks before I raise a PR. In the meantime, I'd like to get feedback on the approach to this issue.

KentonWhite commented 7 years ago

@connectedblue I'm not quite following the problem. The convention of ProjectTemplate is that only variable names that match the file name will be checked prior to processing the file in the data directory.

In my opinion keeping the global namespace clean is the responsibility of the user. If the user wishes to create multiple global variables during the data step this is their choice. Sometimes the intended behavior is to create multiple global variables, as is the case when an excel workbook with multiple sheets is loaded.

This is why I'm not sure I'm following the issue! Perhaps I'm missing the larger picture here. In your attached example (thanks for that btw!) what did you originally have? How did your original solution create an unexpected behavior within ProjectTemplate? What was the expected behavior compared to what happened? That might help me better understand the larger context of your comment :)

connectedblue commented 7 years ago

Hi @KentonWhite, I agree the user has to be responsible - I just also wanted to be lazy and avoid having to wrap unwanted variables in a function :) As you say, multiple variables are also created in excel files, so that is a smaller point and perhaps a red herring.

However, the larger point is the unnecessary processing that occurs when the variable name is not the same. In my example, I originally had the variable called test_corpus (unfortunately I didn't push that code to git). The Corpus function can take 15 minutes to load (as can excel files sometimes) so I really only want to do it once and use the cached result in subsequent load.project().

What was happening is that test_corpus was being loaded first, correctly from the cache, but then it was loaded again when the corpus.R file was executed (because no corpus variable exists).

So, I think the bigger picture is that users who have slow loading data in .R files will always need to create a variable with the same name. Furthermore, if they create say two variables in the same file from different sources, then you would have to clear the cache and re-generate both even if only one has changed. In my example, if I created both test_corpus and corpus in the same corpus.R, then if I wanted to regenerate the test data, I could only do so by deleting the corpus variable from both global env and cache.

So, in a round about way, I'm saying that when users are using .R files to generate large data sources, it really only makes sense to have one variable per file which is also the same name as the file. Given that the creation of .R is under the control of the user (they probably haven't inherited it) it's painless to create multiple files for each dataload.

The reason I would make this a convention is to document it so that users don't have to go through the same thinking to arrive at the most efficient conclusion. We can add examples on the website to show how .R should be used for efficient data acquisition.

The excel case is different though. Usually the user has been given a file from another source with multiple tabs and they all have to be loaded from the same file. Here it is also possible that none of the tab names have the same name as the file, so it will be unnecessarily loaded every time, even though the underlying data has already been cached. Excel loads can also take quite a long time.

What I was going to suggest for the excel case was if, after loading the file there is no global env variable with the same name as the file, then have the reader create one with a value of TRUE. This will then get auto-cached. So, on subsequent invocations of load.project the excel file won't be re-read. If the user want to force a re-read, then all they have to do is delete the TRUE variable from the cache and memory.

Sorry for the long reply. Does that make more sense? I'm really keen to make sure that users run load.project() as often as possible and not have them think it will be painful.

By the way, I don't know if you listen to the Not So Standard Deviations podcast, but @hilaryparker made a throw away comment in the last episode that she only runs load.project() sparingly (and she's a massive ProjectTemplate fan!).

KentonWhite commented 7 years ago

@connectedblue Ok I see. just to make sure I understand, the proposal is to make it a convention that there can only be one global variable in a data/*.R file and that the global variable must be the same name as the data/*.R file. Any other global variables created will be deleted and not reproduced later. Have I got it right?

connectedblue commented 7 years ago

Yes, @KentonWhite , that's what I was thinking. The r.reader() enforces the convention on a per file basis.

Some appropriate messaging during data load would help guide the user if the convention is not followed, and the man page would outline the convention.

I can knock up a prototype to test the concept further and see if it makes sense.

KentonWhite / ProjectTemplate

use of .R files in the data directory #165