Closed connectedblue closed 7 years ago
@connectedblue I'm not quite following the problem. The convention of ProjectTemplate is that only variable names that match the file name will be checked prior to processing the file in the data
directory.
In my opinion keeping the global namespace clean is the responsibility of the user. If the user wishes to create multiple global variables during the data step this is their choice. Sometimes the intended behavior is to create multiple global variables, as is the case when an excel workbook with multiple sheets is loaded.
This is why I'm not sure I'm following the issue! Perhaps I'm missing the larger picture here. In your attached example (thanks for that btw!) what did you originally have? How did your original solution create an unexpected behavior within ProjectTemplate? What was the expected behavior compared to what happened? That might help me better understand the larger context of your comment :)
Hi @KentonWhite, I agree the user has to be responsible - I just also wanted to be lazy and avoid having to wrap unwanted variables in a function :) As you say, multiple variables are also created in excel files, so that is a smaller point and perhaps a red herring.
However, the larger point is the unnecessary processing that occurs when the variable name is not the same. In my example, I originally had the variable called test_corpus
(unfortunately I didn't push that code to git). The Corpus
function can take 15 minutes to load (as can excel files sometimes) so I really only want to do it once and use the cached result in subsequent load.project()
.
What was happening is that test_corpus
was being loaded first, correctly from the cache, but then it was loaded again when the corpus.R
file was executed (because no corpus
variable exists).
So, I think the bigger picture is that users who have slow loading data in .R
files will always need to create a variable with the same name. Furthermore, if they create say two variables in the same file from different sources, then you would have to clear the cache and re-generate both even if only one has changed. In my example, if I created both test_corpus
and corpus
in the same corpus.R
, then if I wanted to regenerate the test data, I could only do so by deleting the corpus
variable from both global env and cache.
So, in a round about way, I'm saying that when users are using .R
files to generate large data sources, it really only makes sense to have one variable per file which is also the same name as the file. Given that the creation of .R
is under the control of the user (they probably haven't inherited it) it's painless to create multiple files for each dataload.
The reason I would make this a convention is to document it so that users don't have to go through the same thinking to arrive at the most efficient conclusion. We can add examples on the website to show how .R
should be used for efficient data acquisition.
The excel case is different though. Usually the user has been given a file from another source with multiple tabs and they all have to be loaded from the same file. Here it is also possible that none of the tab names have the same name as the file, so it will be unnecessarily loaded every time, even though the underlying data has already been cached. Excel loads can also take quite a long time.
What I was going to suggest for the excel case was if, after loading the file there is no global env variable with the same name as the file, then have the reader create one with a value of TRUE
. This will then get auto-cached. So, on subsequent invocations of load.project
the excel file won't be re-read. If the user want to force a re-read, then all they have to do is delete the TRUE
variable from the cache and memory.
Sorry for the long reply. Does that make more sense? I'm really keen to make sure that users run load.project()
as often as possible and not have them think it will be painful.
By the way, I don't know if you listen to the Not So Standard Deviations podcast, but @hilaryparker made a throw away comment in the last episode that she only runs load.project()
sparingly (and she's a massive ProjectTemplate
fan!).
@connectedblue Ok I see. just to make sure I understand, the proposal is to make it a convention that there can only be one global variable in a data/*.R
file and that the global variable must be the same name as the data/*.R
file. Any other global variables created will be deleted and not reproduced later. Have I got it right?
Yes, @KentonWhite , that's what I was thinking. The r.reader()
enforces the convention on a per file basis.
Some appropriate messaging during data load would help guide the user if the convention is not followed, and the man page would outline the convention.
I can knock up a prototype to test the concept further and see if it makes sense.
I've run into an issue on one of my projects which @KentonWhite highlighted when #160 was being developed to auto cache data during load.project()
This file demonstrates the issues, and how I worked around them:
https://github.com/connectedblue/capstone/blob/master/capstone/data/corpus.R
In this project, I am trying to create a test data set from the original data during my exploratory analysis (I use a custom config variable flag to toggle between the full and testing set to save rewriting the data acquisition code).
The way
.R
files are treated in thedata
directory results in the following issues:corpus_dir
,docs
anddoc
are side effects used to create the intended variablecorpus
. I wrapped all the processing into a function so that these variables did not find their way into the global environment..R
file, the data file will always be processed in thedata
directory, This duplicates processing unnecessarily when the variables are already in memory or the cache. I got around this by making the generated variable the same name as the file, and it is skipped duringload.project()
as a result.These same issues would also occur in other data loaders such as excel and sql where the datafile can generate multiple global environment variables.
I have a proposed solution for the
.R
problems above, but I want to get the community's view before submitting a PR.Essentially I propose using convention over configuration, and enforce a rule that
.R
files in thedata
directory can only produce one global variable with the same name as the file. Any other global variables would be deleted which means that side effect variables like mine above don't have to be wrapped in a function to avoid polluting the global environment.We'd have to put some migration logic in to warn the user about any legacy code that would need to be looked at.
I think it's good practice anyway to have one variable per
.R
file in thedata
directory, and enforcing it brings more benefits to users when they runload.project()
making it faster to load by relying on memory and disk caching.The change would be quite straightforward - I'm happy to do it, but I don't want to get distracted from my project so it'll be a couple of weeks before I raise a PR. In the meantime, I'd like to get feedback on the approach to this issue.