Closed Hugovdberg closed 7 years ago
Oh, it looks like a monstrous list of commits, apparently GitHub didn't figure out that most of those were already present from the data ignore PR.
Hi @Hugovdberg, it's a large list of changes proposed. I'm having a bit of difficulty trying to understand what they are trying to achieve.
Could you summarise please what the problem is that the list.data()
solves for the user? I can see that it's a list of files that will be processed, but I'm not sure what I'd use it for. Does it help in munge scripts, or is it an interactive tool?
thanks
list.data
is mostly an interactive tool that helps test ignore patterns, custom readers, that kind of things. Mostly a debugging tool therefore as it lists exacly what load.project
will load into memory.
The .list.data
function is the true workhorse that is also called by my new version of load.project
, it builds the definitive list of all variables that should be loaded into memory. By building a list with both the cached and raw variables the ignore patterns are applied more consistently. Also, I noticed some performance improvement in the tests since a few of the original loops are vectorised now (for example the whole list of available readers is processed just once instead of for every variable separately).
I'm sorry it's such a long list of changes, but list.data
by itself doesn't make much sense (especially when it's listing different things from what's actually loaded), while it's essential to the rewritten data loading phase. The other changes to load.project
could perhaps have been moved to a separate PR but the original load.project
function did so many things within a single function call that it was sometimes hard to grasp what the main components were. By moving those components to separate functions the load.project
function itself became structured more clearly, while the new functions explain self-contained parts of the process.
I'm afraid I'm struggling to review this PR.
One the one hand, I can't visualise a problem that list.data
solves for me, so I have no reference point to know if what it does is correct or not. I am generally concerned that a lot of internal functions are changing to support it. It's not a problem to change internal implementation per se, but there needs to be strong reasons driven by user requirements (not just "it's better this way").
Another issue I have is that there a lot of inconsequential edits that are getting in the way. There shouldn't be a need to add and delete blank lines/tabs. Neither should code be re-written to do exactly the same thing (e.g. removing paste0
from message calls that result in exactly the same message being output). It all just gets in the way of a review.
Related to the last point, I can't tell whether cache.R
is fundamentally changing or not. That's a function I do care passionately about and would like to be satisfied I know what's going on inside it (beyond the fact that the tests still pass).
Is there a way you can break the PR into more manageable chunks for review?
The changes in whitespace are (a bit) annoying, I agree, but is the effect of the RStudio project settings. I think the latest version of RStudio started to change sourcefiles according to those settings on saving the file because I do not remember that occuring unless code was actually copy-pasted. It's not that I want to start a white space war, but we might want to consider a standard (the project's defaults are 2 spaces, I prefer 4 but use the 2, and I think @connectedblue uses 8). I'm not sure how best to resolve this issue that keeps popping up. I noticed though that the changes are visible more clearly in split view than in the default unified view.
Regarding cache.R
: The only real changes are the added .is.cached
function (which could/should at a later point be adapted for better cache invalidation) and the removal of .load.cache
as it is no longer a separate stage in load.project
. I very much respect the improvements of the caching mechanism you made, and the heart of the cache loading was simply copied into the new version of the .load.data
function. I do not mean to change the file just to have my name tagged on it :-)
list.data
could be removed without too much trouble, but I think the .list.data
is a large improvement of the .load.cache
/.load.data
. It's not that .list.data
/list.data
could not exist without changing the loading process but that there were inconsistencies in the loading process with previously cached data loaded if ignored later. That was an issue that was caught during the testing of #178. The only way around that, before the introduction of .list.data
, was to disable either auto caching or loading from cache at all.
With the new functionality the list of all available variables is compiled once, ignored where appropriate, and marked whether it is cached. That's all very nice, but if the actual loading of the data does something different there is no point in adding this function. The other way around, changing load.project
without .list.data
makes no sense either. Therefore I changed load.project
simultaneously, which as a bonus reduced some overhead as well.
Some of the minor changes to load.project
could perhaps be moved to another PR, but I'm not sure it's that easy or that the separate PR will be much clearer with the moved lines. I do think the separation of load.project
into functions representing the main components is better, even though it doesn't change user functionality. The code is much cleaner now and easier to maintain, and perhaps all the if statements that remain in load.project
should move to the separate functions as well.
I do not agree with you that changes that make the code cleaner without changing user functionality should be avoided. Consider being a new developer who wants to see what load.project
does. Previously it was a long list of code with some actions dispatched to other functions while other things were happening inline. Now it is a small list of functions that is applied, with rather descriptive names that show exactly what happens in which order.
As said, the list.data
is mostly a debugging tool that helps in cases like you experienced with the ignore patterns in reviewing #178. If you prefer to not export the functionality at all that's fine by me, but all tests have to be rewritten to either .list.data
or to point to load.project
. Changing to .list.data
is not much work at all and keeps the tests to a single unit of functionality (great!), but then we are also testing internal functions (not so great..). Changing to load.project
is much more involved and only tests user functions (great!), but then again it is more an integration test than a unit test (not so great..) where effects are tested indirectly without being sure if/how changes to .list.data
affect load.project
.
Again, I'm sorry it seems like a massive rewrite, which is made worse by the visualisation of the diffs, but I still think it's a consistent set of changes which isn't easily broken down into smaller pieces that are clearer. I would therefore prefer to keep the PR as it is unless you or @KentonWhite could point out some logical subsets of the changes that can be applied without manually rewriting everything as upstream changes were incorporated in the commits at least twice (apparently the link between the commits in a PR and the upstream master is broken when squashing commits once the PR is merged).
OK, I think I see what you're saying. I must admit I haven't delved into the new load.project
structure in the raw file - I was trying to understand from the diff files only to build up a picture of the changes. This is where it got confusing for me. I'll go and look at the clean source files.
I'm not sure either how it can be split up. The PR was about a list.data
function, but now I think you're saying that the change is really about refactoring load.project
to make it more maintainable.
I don't have a particular view of the maintainability or otherwise of the current structure. However, when developing #160, we agreed that checking variables before and after data load to see what needs caching could give rise to some issues. So there is always room for improvement. Does this PR solve that?
I'll have a review of the clean source code and see if that helps shed any light.
I prefer to raise a PR in a way that states a problem and then proposes a solution. That way, we can debate whether or not it is a generic problem faced by lots of users, and if so, does the proposed solution solve all aspects of the problem in a reasonable way.
At the moment, you're are stating the problem as "poor maintainability of load.project
" and proposing a solution. I'm not sold yet on this as being a problem, although I agree there are brittle issues with the way multiple variables are loaded from single data files (e.g. excel, db etc). If we could just find a way of being more precise with the problem being solved, it would make reviewing a little easier.
@connectedblue thanks for the suggestion, I tried to make the opening post more clear. I hope you'll find reviewing a little easier now.
Regarding the possible issues mentioned in #160, I'm not sure what problems might occur. The main difference between your previous edits and mine is that I wrapped the calls to .var.diff.from
directly around the call to the reader so there's no need to add some variables to project.config$data
and then later add the real variable names and extract the old. Also, the real variable names are passed to the data.table
conversion function.
I must say the issue of the xls.reader
being called after the contents were loaded from cache isn't fixed by this. Perhaps we should move the tests of variables existing in the workspace into the readers so that they test the final variable name?
OK, great :) Now I can see a bit more clearly.
Regarding your first sentence:
While testing PR #178 (to be able to ignore files) an issue surfaced that sometimes data was loaded from cache even though the source was ignored.
So this is saying that if Thumbs.db
should be ignored in data
then it should also be ignored in cache
?
This is a little more subtle - In normal operation, there would be no value Thumbs
in the cache because it was never loaded in the first place. If however, as a user I want to create a variable called Thumbs
in a munge script and cache it to help load faster, then I should be able to do that.
In other words, data_ignore
should only apply to files in the data
directory. Whatever the use wants to put in the cache, should be up to them.
I agree that variables created during munching should be loaded regardless of the data_ignore
settings. That is actually the reason for the cache_only
flag created by list.data
. Variables that were not derived from a file in data/
are always loaded, the only difference is in variables that were derived from a file.
If I have three large files and I want to ignore one temporarily, but I loaded it before I don't want it to pop into memory because it was loaded from cache anyway.
I do agree there might be edge cases where files are always ignored (such as Thumbs.db
) but a variable is cached during munging that matches that ignored filename. Those caches will indeed be ignored for now, as there is no way to determine that it didn't originate from that file (as far as I know). On the other hand I think those will be rarer than the other issue, but this behaviour should at least be documented.
Thanks for the read of the code and discussion -- answers quite a few questions that I might have had :)
I'll also have a read of this in the next few days.
Tried to merge this in today and getting many of the new tests failing. If you can, please have a look at the failing tests and let me know if they are easy fixes. I'm closing for now.
I noticed some renewed activity on the project and feared it might break some of this code. I will look into it to see what's changed.
I think I fixed the errors. After the merge .read.cache.info
did not return any value, .list.data
heavily relies on this function so load.project
failed in each and every test.
I also included a small change to give priority to the file.reader
. This was mentioned in #189, to allow per file override of configuration.
I guess the new commit doesn't show until this PR is reopened, could you do that @KentonWhite?
First of all, sorry for the rather largish PR.
Background
While testing PR #178 (to be able to ignore files) an issue surfaced that sometimes data was loaded from cache even though the source was ignored. This happens because the cache loading and data loading are two completely separate stages in
load.project
. Also, the older issues #137 / #140 point out that converting todata.table
doesn't work properly when the actual variable name is different from the variable name suggested by the file name. Furthermore, from a maintenance standpointload.project
in its current form is rather hard to read as some of the actions are implemented within the function, while other actions are dispatched to separate functions. Also, some optimisation is possible by not looping over the.extension.dispatch.table
for every file separately.General solution
To overcome the inconsistent behaviour of ignoring datasets all available data should be gathered in a single function, in which also should be determined whether the cached state is valid, what reader should be used if necessary, etc. In
load.project
the.load.cache
and.load.data
functions should be integrated into one function that loads the data from cache if possible and otherwise uses the predetermined reader to load the data from source. The problem of readers generating different variable names from the suggested variable name can be overcome by determining the difference in which variables exist right before and after the reader is called. That way only the new variables are processed in further steps. These new variables can then be converted todata.table
if necessary. While changing the functionality of the data loading/conversion/caching it is nice to extract this functionality into separate functions to make the logic behind theload.project
function more clear.Implementation
This PR provides the following new functionality:
Major changes:
.list.data
that shows a list of the files and variables that will be loaded byload.project
, including meta data such as its cache state and the configured reader. It can be called before loading the project as long as it's provided the configuration. All changes were implemented using test driven development, so there are more lines of tests than new code and every aspect of the function should be tested under different circumstances. (.list.data
is not exported)list.data
that wraps calls to.load.config
and.list.data
(list.data
is exported, but is useful mostly for debugging ignore patterns and determining which reader will be used).load.cache
into.load.data
by using the new.list.data
function. This makes the loading process much more efficient. The separate "Loading cache" stage is removed, and.load.data
now shows whether the data was loaded from source or cache. Variables that exist in the cache only are loaded first so readers that output more than one variable can skip existing variables.data.table
and auto caching into the.load.data
function for efficiencyMinor changes
.prepare.data.ignore.regex
function (introduced by the data ignore PR) tolist.data.R
as that's a more logical location.is.cached
that currently wraps.read.cache.info
but allows a vector of variables as input. Should be expanded in a separate PR to do more checking on cache validity.load.project
that more clearly show the program flow during project loading. They all have the same signaturefunction(config, my.project.info)
, and all returnmy.project.info
even though some of the functions don't actually modifymy.project.info
.