Hugovdberg commented 7 years ago

First of all, sorry for the rather largish PR.

Background

While testing PR #178 (to be able to ignore files) an issue surfaced that sometimes data was loaded from cache even though the source was ignored. This happens because the cache loading and data loading are two completely separate stages in load.project. Also, the older issues #137 / #140 point out that converting to data.table doesn't work properly when the actual variable name is different from the variable name suggested by the file name. Furthermore, from a maintenance standpoint load.project in its current form is rather hard to read as some of the actions are implemented within the function, while other actions are dispatched to separate functions. Also, some optimisation is possible by not looping over the .extension.dispatch.table for every file separately.

General solution

To overcome the inconsistent behaviour of ignoring datasets all available data should be gathered in a single function, in which also should be determined whether the cached state is valid, what reader should be used if necessary, etc. In load.project the .load.cache and .load.data functions should be integrated into one function that loads the data from cache if possible and otherwise uses the predetermined reader to load the data from source. The problem of readers generating different variable names from the suggested variable name can be overcome by determining the difference in which variables exist right before and after the reader is called. That way only the new variables are processed in further steps. These new variables can then be converted to data.table if necessary. While changing the functionality of the data loading/conversion/caching it is nice to extract this functionality into separate functions to make the logic behind the load.project function more clear.

Implementation

This PR provides the following new functionality:

Major changes:

All new function .list.data that shows a list of the files and variables that will be loaded by load.project, including meta data such as its cache state and the configured reader. It can be called before loading the project as long as it's provided the configuration. All changes were implemented using test driven development, so there are more lines of tests than new code and every aspect of the function should be tested under different circumstances. (.list.data is not exported)
All new function list.data that wraps calls to .load.config and .list.data (list.data is exported, but is useful mostly for debugging ignore patterns and determining which reader will be used)
Merged .load.cache into .load.data by using the new .list.data function. This makes the loading process much more efficient. The separate "Loading cache" stage is removed, and .load.data now shows whether the data was loaded from source or cache. Variables that exist in the cache only are loaded first so readers that output more than one variable can skip existing variables.
Merged converting to data.table and auto caching into the .load.data function for efficiency

Minor changes

Added more comments throughout these functions for clarity
Moved .prepare.data.ignore.regex function (introduced by the data ignore PR) to list.data.R as that's a more logical location
New function .is.cached that currently wraps .read.cache.info but allows a vector of variables as input. Should be expanded in a separate PR to do more checking on cache validity.
Several helper functions for load.project that more clearly show the program flow during project loading. They all have the same signature function(config, my.project.info), and all return my.project.info even though some of the functions don't actually modify my.project.info.

Hugovdberg commented 7 years ago

Oh, it looks like a monstrous list of commits, apparently GitHub didn't figure out that most of those were already present from the data ignore PR.

connectedblue commented 7 years ago

Hi @Hugovdberg, it's a large list of changes proposed. I'm having a bit of difficulty trying to understand what they are trying to achieve.

Could you summarise please what the problem is that the list.data() solves for the user? I can see that it's a list of files that will be processed, but I'm not sure what I'd use it for. Does it help in munge scripts, or is it an interactive tool?

thanks

Hugovdberg commented 7 years ago

list.data is mostly an interactive tool that helps test ignore patterns, custom readers, that kind of things. Mostly a debugging tool therefore as it lists exacly what load.project will load into memory. The .list.data function is the true workhorse that is also called by my new version of load.project, it builds the definitive list of all variables that should be loaded into memory. By building a list with both the cached and raw variables the ignore patterns are applied more consistently. Also, I noticed some performance improvement in the tests since a few of the original loops are vectorised now (for example the whole list of available readers is processed just once instead of for every variable separately).

I'm sorry it's such a long list of changes, but list.data by itself doesn't make much sense (especially when it's listing different things from what's actually loaded), while it's essential to the rewritten data loading phase. The other changes to load.project could perhaps have been moved to a separate PR but the original load.project function did so many things within a single function call that it was sometimes hard to grasp what the main components were. By moving those components to separate functions the load.project function itself became structured more clearly, while the new functions explain self-contained parts of the process.

connectedblue commented 7 years ago

I'm afraid I'm struggling to review this PR.

One the one hand, I can't visualise a problem that list.data solves for me, so I have no reference point to know if what it does is correct or not. I am generally concerned that a lot of internal functions are changing to support it. It's not a problem to change internal implementation per se, but there needs to be strong reasons driven by user requirements (not just "it's better this way").

Another issue I have is that there a lot of inconsequential edits that are getting in the way. There shouldn't be a need to add and delete blank lines/tabs. Neither should code be re-written to do exactly the same thing (e.g. removing paste0 from message calls that result in exactly the same message being output). It all just gets in the way of a review.

Related to the last point, I can't tell whether cache.R is fundamentally changing or not. That's a function I do care passionately about and would like to be satisfied I know what's going on inside it (beyond the fact that the tests still pass).

Is there a way you can break the PR into more manageable chunks for review?

Hugovdberg commented 7 years ago

The changes in whitespace are (a bit) annoying, I agree, but is the effect of the RStudio project settings. I think the latest version of RStudio started to change sourcefiles according to those settings on saving the file because I do not remember that occuring unless code was actually copy-pasted. It's not that I want to start a white space war, but we might want to consider a standard (the project's defaults are 2 spaces, I prefer 4 but use the 2, and I think @connectedblue uses 8). I'm not sure how best to resolve this issue that keeps popping up. I noticed though that the changes are visible more clearly in split view than in the default unified view.

Regarding cache.R: The only real changes are the added .is.cached function (which could/should at a later point be adapted for better cache invalidation) and the removal of .load.cache as it is no longer a separate stage in load.project. I very much respect the improvements of the caching mechanism you made, and the heart of the cache loading was simply copied into the new version of the .load.data function. I do not mean to change the file just to have my name tagged on it :-)

list.data could be removed without too much trouble, but I think the .list.data is a large improvement of the .load.cache/.load.data. It's not that .list.data/list.data could not exist without changing the loading process but that there were inconsistencies in the loading process with previously cached data loaded if ignored later. That was an issue that was caught during the testing of #178. The only way around that, before the introduction of .list.data, was to disable either auto caching or loading from cache at all. With the new functionality the list of all available variables is compiled once, ignored where appropriate, and marked whether it is cached. That's all very nice, but if the actual loading of the data does something different there is no point in adding this function. The other way around, changing load.project without .list.data makes no sense either. Therefore I changed load.project simultaneously, which as a bonus reduced some overhead as well.

Some of the minor changes to load.project could perhaps be moved to another PR, but I'm not sure it's that easy or that the separate PR will be much clearer with the moved lines. I do think the separation of load.project into functions representing the main components is better, even though it doesn't change user functionality. The code is much cleaner now and easier to maintain, and perhaps all the if statements that remain in load.project should move to the separate functions as well. I do not agree with you that changes that make the code cleaner without changing user functionality should be avoided. Consider being a new developer who wants to see what load.project does. Previously it was a long list of code with some actions dispatched to other functions while other things were happening inline. Now it is a small list of functions that is applied, with rather descriptive names that show exactly what happens in which order.

As said, the list.data is mostly a debugging tool that helps in cases like you experienced with the ignore patterns in reviewing #178. If you prefer to not export the functionality at all that's fine by me, but all tests have to be rewritten to either .list.data or to point to load.project. Changing to .list.data is not much work at all and keeps the tests to a single unit of functionality (great!), but then we are also testing internal functions (not so great..). Changing to load.project is much more involved and only tests user functions (great!), but then again it is more an integration test than a unit test (not so great..) where effects are tested indirectly without being sure if/how changes to .list.data affect load.project.

Again, I'm sorry it seems like a massive rewrite, which is made worse by the visualisation of the diffs, but I still think it's a consistent set of changes which isn't easily broken down into smaller pieces that are clearer. I would therefore prefer to keep the PR as it is unless you or @KentonWhite could point out some logical subsets of the changes that can be applied without manually rewriting everything as upstream changes were incorporated in the commits at least twice (apparently the link between the commits in a PR and the upstream master is broken when squashing commits once the PR is merged).

connectedblue commented 7 years ago

OK, I think I see what you're saying. I must admit I haven't delved into the new load.project structure in the raw file - I was trying to understand from the diff files only to build up a picture of the changes. This is where it got confusing for me. I'll go and look at the clean source files.

I'm not sure either how it can be split up. The PR was about a list.data function, but now I think you're saying that the change is really about refactoring load.project to make it more maintainable.

I don't have a particular view of the maintainability or otherwise of the current structure. However, when developing #160, we agreed that checking variables before and after data load to see what needs caching could give rise to some issues. So there is always room for improvement. Does this PR solve that?

I'll have a review of the clean source code and see if that helps shed any light.

I prefer to raise a PR in a way that states a problem and then proposes a solution. That way, we can debate whether or not it is a generic problem faced by lots of users, and if so, does the proposed solution solve all aspects of the problem in a reasonable way.

At the moment, you're are stating the problem as "poor maintainability of load.project" and proposing a solution. I'm not sold yet on this as being a problem, although I agree there are brittle issues with the way multiple variables are loaded from single data files (e.g. excel, db etc). If we could just find a way of being more precise with the problem being solved, it would make reviewing a little easier.

Hugovdberg commented 7 years ago

@connectedblue thanks for the suggestion, I tried to make the opening post more clear. I hope you'll find reviewing a little easier now.

Regarding the possible issues mentioned in #160, I'm not sure what problems might occur. The main difference between your previous edits and mine is that I wrapped the calls to .var.diff.from directly around the call to the reader so there's no need to add some variables to project.config$data and then later add the real variable names and extract the old. Also, the real variable names are passed to the data.table conversion function.

I must say the issue of the xls.reader being called after the contents were loaded from cache isn't fixed by this. Perhaps we should move the tests of variables existing in the workspace into the readers so that they test the final variable name?

connectedblue commented 7 years ago

OK, great :) Now I can see a bit more clearly.

Regarding your first sentence:

While testing PR #178 (to be able to ignore files) an issue surfaced that sometimes data was loaded from cache even though the source was ignored.

So this is saying that if Thumbs.db should be ignored in data then it should also be ignored in cache?

This is a little more subtle - In normal operation, there would be no value Thumbs in the cache because it was never loaded in the first place. If however, as a user I want to create a variable called Thumbs in a munge script and cache it to help load faster, then I should be able to do that.

In other words, data_ignore should only apply to files in the data directory. Whatever the use wants to put in the cache, should be up to them.

Hugovdberg commented 7 years ago

I agree that variables created during munching should be loaded regardless of the data_ignore settings. That is actually the reason for the cache_only flag created by list.data. Variables that were not derived from a file in data/ are always loaded, the only difference is in variables that were derived from a file. If I have three large files and I want to ignore one temporarily, but I loaded it before I don't want it to pop into memory because it was loaded from cache anyway.

I do agree there might be edge cases where files are always ignored (such as Thumbs.db) but a variable is cached during munging that matches that ignored filename. Those caches will indeed be ignored for now, as there is no way to determine that it didn't originate from that file (as far as I know). On the other hand I think those will be rarer than the other issue, but this behaviour should at least be documented.

KentonWhite commented 7 years ago

Thanks for the read of the code and discussion -- answers quite a few questions that I might have had :)

I'll also have a read of this in the next few days.

KentonWhite commented 7 years ago

Tried to merge this in today and getting many of the new tests failing. If you can, please have a look at the failing tests and let me know if they are easy fixes. I'm closing for now.

Hugovdberg commented 7 years ago

I noticed some renewed activity on the project and feared it might break some of this code. I will look into it to see what's changed.

Hugovdberg commented 7 years ago

I think I fixed the errors. After the merge .read.cache.info did not return any value, .list.data heavily relies on this function so load.project failed in each and every test. I also included a small change to give priority to the file.reader. This was mentioned in #189, to allow per file override of configuration.

I guess the new commit doesn't show until this PR is reopened, could you do that @KentonWhite?

KentonWhite / ProjectTemplate

Add list.data and incorporate in load.project #187

Background

General solution

Implementation

Major changes:

Minor changes