Closed Hugovdberg closed 7 years ago
Sorry, apparently I forgot a reload. The merge of my changes to load.project
with the upstream changes isn't complete. Will fix it now.
It's always nice to get a green tick :)
It would be relatively straightforward to add a test into test-config.R
(you could write a new one based on those already there). I would suggest adding *.csv
to the config$data_ignore
parameter, copy in inst/example_data/example_01.csv
into the data
directory, run a load.project()
and test that the variable example_01
doesn't exist.
For belt and braces, you could then reset config$data_ignore
back to NULL
, run a load.project()
and test that the variable example_01
now exists.
(BTW - what does actually happen if someone leaves the data_ignore
parameter blank? )
An empty parameter is considered a literal filename as it doesn't start and end with a /
. Literal filenames are always matched from start to end by prepending ^
and appending $
, resulting in the regular expression ^$
. So a blank option will actually match only a zero-length filename directly under data/
, which can never exist as far as I know. Therefore leaving the parameter blank will match none of the files under data/
which is exactly as expected. This even works when accidentally adding two commas.
I don't have any experience with testthat
, but I'll see how far I can get!
You'll find it quite therapeutic I think - always good to know that future developments won't unwittingly break your change. Let me know if you have any questions.
I would have found this feature very useful in my last project when I had a lot of corpus data floating around data
that I wanted to ignore most of the time.
I'll install your branch this evening and test it out a bit.
[offtopic]Sorry I wasn't able to push this earlier, first I was involved in the same capstone project as you were, and later the dayjob needed my full attention. I switched to the next session starting February as I just couldn't find the time to get my model to produce any sensible results..[/offtopic]
Yes, same here... It has been such a fabulous course that I didn't want to rush a piece of rubbish for the capstone. So I also delayed to give myself more time.
Perhaps we should collaborate on ideas to get a better model. If we take a look at each others repo, we might get some ideas ...
Hi @Hugovdberg
I've installed your branch but am having difficulty in getting it to work. I have a file named fix_locations.csv in the data
directory and I set the following in globals.dcf
:
data_ignore: /*\\.csv/
I tried various other combinations but couldn't get it to work. What's the format I should use to ignore .csv
files?
Also, are the slashes around the regular expression really necessary? Couldn't each expression be treated as a regexp, so I would write Thumbs\\.db
for example?
There is a note in the documentation, you shouldn't escape your backslashes in the global.dcf
unless you mean a literal \
. The pattern should thus be /.*\.csv/
, or simply *.csv
, depending on whether you want to exclude all csv files including those in subdirectories (the former), or just the csv's directly under data/
(the latter).
I made the decision to treat the strings as literals by default to make adding a single file as simple as possible. As written in the documentation, wildcards are allowed for both filenames and directory names. Regular expressions are quite advanced to my opinion and they require some more in depth knowledge to make sure you exclude exactly the files you want to exclude (as you just discovered yourself ;-) ).
Actually, during writing tests I realised *.csv
will ignore all csv files, including those in subdirectories, since the wildcard also matches the path separator. I think this behaviour is actually to be expected, and should be updated in the documentation.
@Hugovdberg yes, it worked for me. It's nice and clean to just specify *.csv
.
I realised that running it several times, the data was being loaded from the cache - i switched off cache_loading
and cache_loaded_data
and it all worked repeatably well. I didn't look at the test cases you just posted, but I guess you reached the same conclusion.
Nice set of tests @Hugovdberg :).
One more thing to keep up to date: The man page for ?project.config
needs updating so that the documentation is available right there in R
. You can use the same text as for the website, but the formatting is a little different (you'll see when you look at project.config.R
). You'll have to rebuild before committing and the man pages get generated automatically.
There's no need to include information about the default settings in project.config
because those values are picked up auto-magically from the package files.
I noticed that cached data was loaded as well while writing the tests. I "fixed" the test for now by disabling caching during the test, but I think we can do better than that by building the list of ignored files and the corresponding cachefiles from those files. The only problem I can see is in R files which potentially generate other variables than the filename implies, I'm not sure if and how these are written to cache. Then again I don't expect people to create huge datasets from an R file, and if the R file isn't an R file that can be loaded (just as the Thumbs.db can't be loaded by R) the ignore pattern will be added before it is cached. Those edge cases should be documented though.
Regarding project.config
, I believe that file was created after I started developing the feature originally. I missed that while merging my changes into the most recent upstream master as there were no conflicts there. I will add the feature documentation there as well.
I'm thinking .load.cache
and .load.data
should be integrated. That way it is very easy to get the list of variables to be imported, check whether it's cached and if so load from cache
, otherwise load from data
. We should also add a function list.data
that checks all files in data/
, tests if they should be ignored, whether they are cached and which reader will be used for the file. It returns a data.frame that can be used directly by load.data
, but which can also be used by the user to see if the correct files are ignored.
If you guys agree we should put this PR on hold to refactor the loading process.
@Hugovdberg I think that your test is fine - disabling caching was important to see the effect of the feature during the test. In real life, you don't need to combine ignore files and cache files because if they are ignored, they won't be in the cache by definition (if someone adds the data_ignore
flag after a file has already been cached, all they need to do is a clear.cache(var)
).
Your point about R files is a good one (I raised a similar issue in #165). I think there is a separate solution to that problem which doesn't interfere with this PR.
I would suggest keeping this PR as it is, and looking at any potential refactor as a separate PR - otherwise too many things will be packed into one change and it'll be hard to see the wood for the trees.
Ok, I'll write a note in the documentation about ignoring already cached files, and create a new PR for the refactored load.project
.
Hi @Hugovdberg I think you're trying to do too many things at once in this PR. There's no need to amend other tests here, especially if they are already passing. The comment about using clear()
was really for your new test.
It's probably not a good idea to update unrelated tests otherwise it might seem as though a regression test is being rewritten to fit the new feature (which I know is not the case here)
Well, the other tests did need the call to clear
as the tests assumed an empty environment. Apparently the tests are not run in order, so those tests failed after I added my test. The tests didn't break because of my changes but because of previously unstated assumptions.
Also, a lot of files contain unnecessary white space, and those tests sometimes contained comments that were almost identical to the next line instead of explaining the purpose. So I figured that those could use a little updating as well. Next time I'll do it in a separate PR. (Also git diff makes the changes look far more drastically than for example meld
does)
I think the last couple of commits have got the purpose of the PR tangled.
The tests at 5a49d20
were running just fine - there was no need to adjust the other tests. As a point of principle, PR's should not adjust previous tests without good reason (e.g. if an existing test needs to change because new functionality makes old behaviour obsolete). Assuming that existing tests aren't working correctly is a potential slippery slope ...
I'd be happier if you rolled back to 5a49d20
and re-applied just the documentation changes. Then you can see for sure that your PR has not broken anything, and has added a new passing test.
I'm not sure I understand what you mean. Don't you mean to say I should roll back to 277f952
, right before 5a49d20
? Because after that I only changed the documentation and made the test for the error on Thumbs.db
more general as Travis tripped over the fact that the error message did not contain the same words as on my machine apparently.
Oh yes, sorry - I meant 277f952
- the one that had the first working set of tests from Travis. I've got a bit confused with the subsequent issues - I think you're saying that there are further test issues to resolve as well as the documentation?
I think all I'm saying is to keep the commits in this PR focused just on the data_ignore
feature. It may well be that some other tests could be coded more efficiently - but as long as they work and are unrelated to data_ignore
, then it's best to leave them alone.
`
I reverted the changes to test-load.R
, keeping all subsequent edits to other files. This produces again a long list of edits, but before the final merge into master
I will squash all commits into one. Or should I do that now to make the review easier?
I think you can keep them as they are in the PR - @KentonWhite usually squashes them into one commit when he applies to master.
Sorry for confusing earlier - I see that I applied by comment about clear()
against the wrong line. Anyway, all seems good now.
Let's hope @KentonWhite also thinks it's ok, I'm finishing up the documentation to a new list.data
function which already assumes there is such a thing as ignored data :-)
@Hugovdberg @connectedblue thanks for carrying on over the holidays. Sorry I've been traveling and not very active.
I like the PR as it is.
I'm waffling on the choice of defaults. On the one hand, Thumbs.db is windows specific -- does it make sense to have a default that only applies to a single operating system? What if users on other operating systems wish to have their system files included in the defaults? We start to get a long list of ignored files! Could it be cleaner if we detect the OS when the project is created and adjust the defaults accordingly? The trade off is that this becomes tricky to test.
On the other hand, including every possible system file for every possible OS in the ignore defaults does make projects more portable. We can be assured that people moving a project from Linux to Windows will have the same environment because Thumbs.db would not have been accidentally loaded. Also having all OS ignore files listed will keep the ignore list clean when people are collaborating on projects using different OS's.
As you can see there are good arguments on either side. Before merging I'd like to get your thoughts!
Hi @KentonWhite, good to see new back, and Happy New Year.
My thought is this is exactly the sort of use case that custom templates is designed for in #176 (closed for now just to re-think the implementation). So users on Windows who like git might have one set of needs which can be templated (and shared), whereas mac users who use Subversion might need something different. The custom templates specifically allow pieces of globals.dcf
to be in the template and merge in correctly to newly created projects.
So, I would say ship with no defaults and let custom templates get designed to handle different use cases (maybe there's even a github repository of them somewhere). That way the ProjectTemplate
core stays clean of implementation environments.
I realise this is a bit of a "jam tomorrow" answer as we don't have the custom templates yet, but hopefully it'll be soon.
Another option might be to leverage the functionality of #169 and put code in as the default value - something like data_ignore: 'local_ignore()'
and then a skeleton file local_ignore()
is included in the libs
directory which could do a local OS dependent read, and also allow users to add their own.
If this solution appeals, it should be done in a separate PR though
I don't think ignore options should be platform dependent. When people collaborate on a project they might not all use the same platform. Initially we discussed adding a pattern to ignore filenames starting with a dot, but hidden files are ignored already by list.files
/dir
. So if there are more files that should be ignored we should add them to the defaults.
@Hugovdberg the local_ignore()
approach doesn't have to return a hard coded list - it can inspect the OS type at load.project
time and make some pre-canned OS-specific choices. This guarantees portability. It can also return some optional, additional user-specified ones. If the whole function is provided as a single file in the lib
directory the user can delete it entirely if they want, or they can add to it.
I think it's potentially a cleaner approach, otherwise this line of globals.dcf
becomes a cluttered mess of default hard coded options which users will find confusing to customise to suit their needs.
However - this doesn't have to be done now. It's an option in the future if the list of defaults gets longer and longer
I agree settling on the defaults can be done in the future, Now I see what you mean by local_ignore
, calling it in load.project
could be a flexible and clean solution indeed. By adding it to the default settings it can be easily removed if deemed unnecessary by the user.
Yes, one for the future I think.
Just to be precise, load.project
wouldn't need to know about, or call local_ignore
directly. If the config is coded as data_ignore:`source('lib/local_ignore.R')`
then when .load.config()
is executed during load.project
the data_ignore
value will be dynamically calculated by whatever local_ignore.R
returns.
So are we going to remove defaults for now and work on a local_ignore
+ custom templates for the future?
I've removed the defaults for now, but I'm not so sure whether a local_ignore
should be added. The trouble is we can't really build a function that works for everyone, and if we let the users build the list it is essentially part of their template. So I'd say make a great custom templates functionality, and let the users ignore what they want.
This is an implementation of the feature discussed in issue #153. It adds a function that parses the
data_ignore
option inglobal.dcf
and creates a regular expression that is evaluated for all files in thedata
directory. Some basic information about the functionality is on the configuring page of the website,mastering.markdown
explains the feature in more detail. Besides the changes in content a lot of trailing whitespace was removed from the files I edited by RStudio, sorry for the long list of deletions.