KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
623 stars 159 forks source link

Query on list.data() function #198

Closed connectedblue closed 7 years ago

connectedblue commented 7 years ago

Hi @Hugovdberg

I'm playing with the new list.data() function and I'm struggling to understand it.

Could you give me some pointers? This is the output I get from one of my current projects:

                   filename                varname is_ignored is_directory is_cached
1               apple_store                             FALSE         TRUE     FALSE
6               desktop.ini                             FALSE        FALSE     FALSE
7  google_play_store_load.R google.play.store.load      FALSE        FALSE     FALSE
8              google_store                             FALSE         TRUE     FALSE
20        load-salesforce.R        load.salesforce      FALSE        FALSE     FALSE
21                README.md                             FALSE        FALSE     FALSE
   cache_only   reader
1       FALSE         
6       FALSE         
7       FALSE r.reader
8       FALSE         
20      FALSE r.reader
21      FALSE         

Basically, I have a couple of .R files which get some data through an API and some standard .csv files in the sub directories shown above.

My questions on the output of list.data() are:

Overall, I'm still unsure exactly what I use list.data() for. Is is supposed to help me make decisions about how to organise large numbers of data sets (which I typically have in my daily workflow)?

Could you give some examples of how you use it in your projects?

Hugovdberg commented 7 years ago
  1. All files and folders are shown because it's partially a debugging tool. If you had expected a file to be loaded but it isn't, you can see from this output that that happens because no suitable reader was found. You are correct in assuming that those will not be loaded, but we cannot simply throw out all lines without a reader since variables that are available in cache only will not have a reader associated either.
  2. That would be fairly easy to implement, since a recursive scan is always performed, but then only the files matching the recursive loading option are shown. Perhaps it would indeed be more consistent to show them but mark them as not to be loaded. Perhaps OR it with the is_ignored variable or would you suggest to add another variable to the data.frame?
  3. is_cached shows if a variable is available from the cache, cache_only shows no obviously matching file in data was found. Variables available from cache only are loaded first so caching is still effective for files that return multiple variables, such as Excel files. I did notice I placed the sorting code in .load.data instead of .list.data.

All in all I don't use list.data() on a daily basis, it just exposes the .list.data function to debug why a variable is or isn't loaded when it's not expected to. The latter tightens the link between cache and source data and make the loading process more efficient. As said before, the list.data function was created because it require 4 lines of code to provide more insight into the loading process for those interested, but as far as I'm concerned, if it clutters the external interface it might as well be removed altogether.

connectedblue commented 7 years ago

Ah, OK. I have to say, I think the average ProjectTemplate user would be mystified by the output of this function as it is, so I'm not sure they would use it as a debugging tool. It might be more usable if it output a human readable narrative of what was loaded, but as it stands today, I think it would be better not to expose as an external function.

Just to check on the logic of the internal changes to load.project() - does it still do what the average ProjectTemplate user expects, i.e. skip a var if it's already in memory, then load from cache if present, finally load from data, and everything loaded in alphabetical order.

I haven't particularly noticed anything unusual on my current projects with the new version, but just wanted to check that the basic rules haven't changed.

KentonWhite commented 7 years ago

When I reviewed I didn't see any of the basic rules changing. If we notice one lets file an issue. The use case I can see for this is when people are reporting bugs with the caching -- useful for checking if it is an issue with the code or their set up.

Closing for now.