geysertimes / geysertimes-r-package

R package for accessing and analyzing the GeyserTimes database
Other
2 stars 4 forks source link

Initial package for getting and loading the data. #3

Closed spkaluzny closed 5 years ago

spkaluzny commented 5 years ago

A complete package for getting the data from the website and loading the data into an R session. See my note under the Package Design issue.

codemasta14 commented 5 years ago

@spkaluzny @TR4Android, @glennon, I've looked through all of it, and it looks good to me. It implements all of the usage I built in my push. I like the standardized naming process as well. I like having gt_ to start the functions. Didn't we discuss having gt_load_data as an internal function and gt_read_data was available to the user? Maybe we could have a wrapper function of the two that is gt_data? I can start on that after we accept this push. Let's go ahead and accept this pull request.

Along the lines of developing and package attribution, @spkaluzny has entered his author information. I can update author information for me and J. Hathaway. @TR4Android and @glennon do you want to provide your name and email so we can update the R description file?

I would like to start working on the help functions. I'll document out the functions and provide a glossary to the variables in the help file.

taltstidl commented 5 years ago

Sorry for the late reply, I just gave this a spin and this looks like an awesome starting point. The way the latest archive version is determined is a bit finicky currently, but that's the best we can currently do. As to the column names, this is something that should definitely be more standardized (@codemasta14 you can find a short glossary on our website, not all fields are contained in the archive though). Discussion on that can continue in issue #4 and I'll merge any needed code changes in a separate pull request.

I love the added documentation, as I see this being an important aspect if we want people to actually use the package 😃. It would be nice if we could keep the vignettes up to date while we develop this, function references can be added at a later point, I guess. I'll try to get a pkgdown website up and running when the package is finished.

@spkaluzny One minor nitpick: can we move everything up one folder (i.e. move the contents of the geysertimes folder to the top level directory)? I feel like this is just an unnecessary indirection here and most packages I know of on GitHub include their R folder at the top level.

Thanks for taking the time to develop and review this!

spkaluzny commented 5 years ago

@TR4Android we can move the package code in my branch up one folder but that folder should be called geysertimes, the name of the package - currently it is called geysertimes-r-package. Is this something I have to do (not that experiences working with Github).

spkaluzny commented 5 years ago

On Wed, May 8, 2019 at 12:28 PM Cody Jorgensen notifications@github.com wrote:

@spkaluzny https://github.com/spkaluzny @TR4Android https://github.com/TR4Android, @glennon https://github.com/glennon, I've looked through all of it, and it looks good to me. It implements all of the usage I built in my push. I like the standardized naming process as well. I like having gt_ to start the functions. Didn't we discuss having gt_load_data as an internal function and gt_read_data was available to the user? Maybe we could have a wrapper function of the two that is gt_data? I can start on that after we accept this push. Let's go ahead and accept this pull request.

I wrote two functions, gt_get_data which is called by the user only once and gt_load_data which the user calls in every session where she wants to analyze the data. Because of CRAN rules, the default for gt_get_data is to put the downloaded data under tempdir() but a message is printing say that using gt_path will allow the data to persist. The default location for gt_get_data is to load the data from gt_path. If we write a wrapper function, gt_data to combine the two operations, the default location for the data will have to be tempdir() and the data will not persist unless the users sets a specific location overriding the default. I think this behavior could lead to users calling gt_data with the default every time which results in the data being downloaded in every session. But I could be persuaded that this would not happen or is not an issue. -Stephen

taltstidl commented 5 years ago

@spkaluzny Two things:

spkaluzny commented 5 years ago

The archive that we submit to CRAN will have to be named geysertimes_1.0.tar.gz (for version 1.0) and it will have to unpack into the folder called geysertimes. On Github the files could be in geysertimes-r-package instead of geysertimes-r-package/geysertimes.

However, most R packages on Github are stored under a folder that is the named the same as the package itself. The function remotes::install_github can be used to install a package directly from Github. If we put the package directly in geysertimes-r-package then the install command would be: remotes::install_github("geysertimes/geysertimes-r-package") but the resulting package that gets installed would be called geysertimes, which could be confusing.

Should I delete my current pull request and update the data-load branch to put the files directly under geysertimes-r-package?

taltstidl commented 5 years ago

I'm guessing that is an issue regardless of whether the subfolder geysertimes exists or not. I don't know the details of the devtools::install_github function, but I'm guessing it will still create a folder called geysertimes-r-package for our package and then download the geysertimes subfolder within that folder (don't know too much about R package development but I can see a possible confusion for the tools here, as it might expect a R folder in the top-level directory?). So if we really wanted to prevent this we'd probably need to rename this repository here on GitHub (which I'm reluctant to do as we might want to develop packages for other programming languages in the future. Python being a prime candidate here as lots of people use it for data analysis and machine learning.). The only option that would come to mind here is to rename both the repository and the package to something like geysertimes-r. Thoughts?

You can simply add any changes to the data-load branch and this pull request will automatically update. No need to close it and then open a new pull request.

spkaluzny commented 5 years ago

I have updated my data-load branch to have the top level package directory be geysertimes-r-package instead of geysertimes-r-package/geysertimes.

With my original structure, you would have to install from Github with: remotes::install_github("geysertimes/geysertimes-r-package/geysertimes") now you should just need remotes::install_github("geysertimes/geysertimes-r-package) but the package that is installed is called geysertimes which is what we want. The R package utilities get the name of the package from the Package: entry in the DESCRIPTION file, not from the directory name or name of the tar.gz archive containing the package source.

taltstidl commented 5 years ago

@spkaluzny Nice, thanks for doing the research 🎉! That should be more aligned with the way most people download packages from GitHub and the package name is also correct 👍. I'm merging this pull request, we can iterate on the data loading process later on if need be. Thanks for your implementation and documentation!