Ignore files in data directory

Hugovdberg commented 7 years ago

This is an implementation of the feature discussed in issue #153. It adds a function that parses the data_ignore option in global.dcf and creates a regular expression that is evaluated for all files in the data directory. Some basic information about the functionality is on the configuring page of the website, mastering.markdown explains the feature in more detail. Besides the changes in content a lot of trailing whitespace was removed from the files I edited by RStudio, sorry for the long list of deletions.

Hugovdberg commented 7 years ago

Sorry, apparently I forgot a reload. The merge of my changes to load.project with the upstream changes isn't complete. Will fix it now.

connectedblue commented 7 years ago

It's always nice to get a green tick :)

It would be relatively straightforward to add a test into test-config.R (you could write a new one based on those already there). I would suggest adding *.csv to the config$data_ignore parameter, copy in inst/example_data/example_01.csv into the data directory, run a load.project() and test that the variable example_01 doesn't exist.

For belt and braces, you could then reset config$data_ignore back to NULL, run a load.project() and test that the variable example_01 now exists.

(BTW - what does actually happen if someone leaves the data_ignore parameter blank? )

Hugovdberg commented 7 years ago

An empty parameter is considered a literal filename as it doesn't start and end with a /. Literal filenames are always matched from start to end by prepending ^ and appending $, resulting in the regular expression ^$. So a blank option will actually match only a zero-length filename directly under data/, which can never exist as far as I know. Therefore leaving the parameter blank will match none of the files under data/ which is exactly as expected. This even works when accidentally adding two commas.

I don't have any experience with testthat, but I'll see how far I can get!

connectedblue commented 7 years ago

You'll find it quite therapeutic I think - always good to know that future developments won't unwittingly break your change. Let me know if you have any questions.

I would have found this feature very useful in my last project when I had a lot of corpus data floating around data that I wanted to ignore most of the time.

I'll install your branch this evening and test it out a bit.

Hugovdberg commented 7 years ago

[offtopic]Sorry I wasn't able to push this earlier, first I was involved in the same capstone project as you were, and later the dayjob needed my full attention. I switched to the next session starting February as I just couldn't find the time to get my model to produce any sensible results..[/offtopic]

connectedblue commented 7 years ago

Yes, same here... It has been such a fabulous course that I didn't want to rush a piece of rubbish for the capstone. So I also delayed to give myself more time.

Perhaps we should collaborate on ideas to get a better model. If we take a look at each others repo, we might get some ideas ...

connectedblue commented 7 years ago

Hi @Hugovdberg

I've installed your branch but am having difficulty in getting it to work. I have a file named fix_locations.csv in the data directory and I set the following in globals.dcf:

data_ignore: /*\\.csv/

I tried various other combinations but couldn't get it to work. What's the format I should use to ignore .csv files?

Also, are the slashes around the regular expression really necessary? Couldn't each expression be treated as a regexp, so I would write Thumbs\\.db for example?

Hugovdberg commented 7 years ago

There is a note in the documentation, you shouldn't escape your backslashes in the global.dcf unless you mean a literal \. The pattern should thus be /.*\.csv/, or simply *.csv, depending on whether you want to exclude all csv files including those in subdirectories (the former), or just the csv's directly under data/ (the latter).

I made the decision to treat the strings as literals by default to make adding a single file as simple as possible. As written in the documentation, wildcards are allowed for both filenames and directory names. Regular expressions are quite advanced to my opinion and they require some more in depth knowledge to make sure you exclude exactly the files you want to exclude (as you just discovered yourself ;-) ).

Hugovdberg commented 7 years ago

Actually, during writing tests I realised *.csv will ignore all csv files, including those in subdirectories, since the wildcard also matches the path separator. I think this behaviour is actually to be expected, and should be updated in the documentation.

connectedblue commented 7 years ago

@Hugovdberg yes, it worked for me. It's nice and clean to just specify *.csv.

I realised that running it several times, the data was being loaded from the cache - i switched off cache_loading and cache_loaded_data and it all worked repeatably well. I didn't look at the test cases you just posted, but I guess you reached the same conclusion.

connectedblue commented 7 years ago

Nice set of tests @Hugovdberg :).

One more thing to keep up to date: The man page for ?project.config needs updating so that the documentation is available right there in R. You can use the same text as for the website, but the formatting is a little different (you'll see when you look at project.config.R). You'll have to rebuild before committing and the man pages get generated automatically.

There's no need to include information about the default settings in project.config because those values are picked up auto-magically from the package files.

Hugovdberg commented 7 years ago

I noticed that cached data was loaded as well while writing the tests. I "fixed" the test for now by disabling caching during the test, but I think we can do better than that by building the list of ignored files and the corresponding cachefiles from those files. The only problem I can see is in R files which potentially generate other variables than the filename implies, I'm not sure if and how these are written to cache. Then again I don't expect people to create huge datasets from an R file, and if the R file isn't an R file that can be loaded (just as the Thumbs.db can't be loaded by R) the ignore pattern will be added before it is cached. Those edge cases should be documented though.

Regarding project.config, I believe that file was created after I started developing the feature originally. I missed that while merging my changes into the most recent upstream master as there were no conflicts there. I will add the feature documentation there as well.

Hugovdberg commented 7 years ago

I'm thinking .load.cache and .load.data should be integrated. That way it is very easy to get the list of variables to be imported, check whether it's cached and if so load from cache, otherwise load from data. We should also add a function list.data that checks all files in data/, tests if they should be ignored, whether they are cached and which reader will be used for the file. It returns a data.frame that can be used directly by load.data, but which can also be used by the user to see if the correct files are ignored. If you guys agree we should put this PR on hold to refactor the loading process.

connectedblue commented 7 years ago

@Hugovdberg I think that your test is fine - disabling caching was important to see the effect of the feature during the test. In real life, you don't need to combine ignore files and cache files because if they are ignored, they won't be in the cache by definition (if someone adds the data_ignore flag after a file has already been cached, all they need to do is a clear.cache(var) ).

Your point about R files is a good one (I raised a similar issue in #165). I think there is a separate solution to that problem which doesn't interfere with this PR.

I would suggest keeping this PR as it is, and looking at any potential refactor as a separate PR - otherwise too many things will be packed into one change and it'll be hard to see the wood for the trees.

Hugovdberg commented 7 years ago

Ok, I'll write a note in the documentation about ignoring already cached files, and create a new PR for the refactored load.project.

connectedblue commented 7 years ago

Hi @Hugovdberg I think you're trying to do too many things at once in this PR. There's no need to amend other tests here, especially if they are already passing. The comment about using clear() was really for your new test.

It's probably not a good idea to update unrelated tests otherwise it might seem as though a regression test is being rewritten to fit the new feature (which I know is not the case here)

Hugovdberg commented 7 years ago

Well, the other tests did need the call to clear as the tests assumed an empty environment. Apparently the tests are not run in order, so those tests failed after I added my test. The tests didn't break because of my changes but because of previously unstated assumptions.

Also, a lot of files contain unnecessary white space, and those tests sometimes contained comments that were almost identical to the next line instead of explaining the purpose. So I figured that those could use a little updating as well. Next time I'll do it in a separate PR. (Also git diff makes the changes look far more drastically than for example meld does)

connectedblue commented 7 years ago

I think the last couple of commits have got the purpose of the PR tangled.

The tests at 5a49d20 were running just fine - there was no need to adjust the other tests. As a point of principle, PR's should not adjust previous tests without good reason (e.g. if an existing test needs to change because new functionality makes old behaviour obsolete). Assuming that existing tests aren't working correctly is a potential slippery slope ...

I'd be happier if you rolled back to 5a49d20 and re-applied just the documentation changes. Then you can see for sure that your PR has not broken anything, and has added a new passing test.

Hugovdberg commented 7 years ago

I'm not sure I understand what you mean. Don't you mean to say I should roll back to 277f952, right before 5a49d20? Because after that I only changed the documentation and made the test for the error on Thumbs.db more general as Travis tripped over the fact that the error message did not contain the same words as on my machine apparently.

connectedblue commented 7 years ago

Oh yes, sorry - I meant 277f952 - the one that had the first working set of tests from Travis. I've got a bit confused with the subsequent issues - I think you're saying that there are further test issues to resolve as well as the documentation?

I think all I'm saying is to keep the commits in this PR focused just on the data_ignore feature. It may well be that some other tests could be coded more efficiently - but as long as they work and are unrelated to data_ignore, then it's best to leave them alone. `

Hugovdberg commented 7 years ago

I reverted the changes to test-load.R, keeping all subsequent edits to other files. This produces again a long list of edits, but before the final merge into master I will squash all commits into one. Or should I do that now to make the review easier?

connectedblue commented 7 years ago

I think you can keep them as they are in the PR - @KentonWhite usually squashes them into one commit when he applies to master.

Sorry for confusing earlier - I see that I applied by comment about clear() against the wrong line. Anyway, all seems good now.

Hugovdberg commented 7 years ago

Let's hope @KentonWhite also thinks it's ok, I'm finishing up the documentation to a new list.data function which already assumes there is such a thing as ignored data :-)

KentonWhite commented 7 years ago

@Hugovdberg @connectedblue thanks for carrying on over the holidays. Sorry I've been traveling and not very active.

I like the PR as it is.

I'm waffling on the choice of defaults. On the one hand, Thumbs.db is windows specific -- does it make sense to have a default that only applies to a single operating system? What if users on other operating systems wish to have their system files included in the defaults? We start to get a long list of ignored files! Could it be cleaner if we detect the OS when the project is created and adjust the defaults accordingly? The trade off is that this becomes tricky to test.

On the other hand, including every possible system file for every possible OS in the ignore defaults does make projects more portable. We can be assured that people moving a project from Linux to Windows will have the same environment because Thumbs.db would not have been accidentally loaded. Also having all OS ignore files listed will keep the ignore list clean when people are collaborating on projects using different OS's.

As you can see there are good arguments on either side. Before merging I'd like to get your thoughts!

connectedblue commented 7 years ago

Hi @KentonWhite, good to see new back, and Happy New Year.

My thought is this is exactly the sort of use case that custom templates is designed for in #176 (closed for now just to re-think the implementation). So users on Windows who like git might have one set of needs which can be templated (and shared), whereas mac users who use Subversion might need something different. The custom templates specifically allow pieces of globals.dcf to be in the template and merge in correctly to newly created projects.

So, I would say ship with no defaults and let custom templates get designed to handle different use cases (maybe there's even a github repository of them somewhere). That way the ProjectTemplate core stays clean of implementation environments.

I realise this is a bit of a "jam tomorrow" answer as we don't have the custom templates yet, but hopefully it'll be soon.

connectedblue commented 7 years ago

Another option might be to leverage the functionality of #169 and put code in as the default value - something like data_ignore: 'local_ignore()' and then a skeleton file local_ignore() is included in the libs directory which could do a local OS dependent read, and also allow users to add their own.

If this solution appeals, it should be done in a separate PR though

Hugovdberg commented 7 years ago

I don't think ignore options should be platform dependent. When people collaborate on a project they might not all use the same platform. Initially we discussed adding a pattern to ignore filenames starting with a dot, but hidden files are ignored already by list.files/dir. So if there are more files that should be ignored we should add them to the defaults.

connectedblue commented 7 years ago

@Hugovdberg the local_ignore() approach doesn't have to return a hard coded list - it can inspect the OS type at load.project time and make some pre-canned OS-specific choices. This guarantees portability. It can also return some optional, additional user-specified ones. If the whole function is provided as a single file in the lib directory the user can delete it entirely if they want, or they can add to it.

I think it's potentially a cleaner approach, otherwise this line of globals.dcf becomes a cluttered mess of default hard coded options which users will find confusing to customise to suit their needs.

However - this doesn't have to be done now. It's an option in the future if the list of defaults gets longer and longer

Hugovdberg commented 7 years ago

I agree settling on the defaults can be done in the future, Now I see what you mean by local_ignore, calling it in load.project could be a flexible and clean solution indeed. By adding it to the default settings it can be easily removed if deemed unnecessary by the user.

connectedblue commented 7 years ago

Yes, one for the future I think.

Just to be precise, load.project wouldn't need to know about, or call local_ignore directly. If the config is coded as data_ignore:`source('lib/local_ignore.R')` then when .load.config() is executed during load.project the data_ignore value will be dynamically calculated by whatever local_ignore.R returns.

KentonWhite commented 7 years ago

So are we going to remove defaults for now and work on a local_ignore + custom templates for the future?

Hugovdberg commented 7 years ago

I've removed the defaults for now, but I'm not so sure whether a local_ignore should be added. The trouble is we can't really build a function that works for everyone, and if we let the users build the list it is essentially part of their template. So I'd say make a great custom templates functionality, and let the users ignore what they want.

KentonWhite / ProjectTemplate

Ignore files in data directory #178