Open taltstidl opened 5 years ago
I suggest that the single file with all the eruptions in the database be available in the package. That file, currently at about 16 Mb is too large for CRAN. CRAN policy states: As a general rule, neither data nor documentation should exceed 5MB
We could provide a function to download and create the full dataset. This would also allow the user to update the data, ensuring they have the latest observations. We do not want to do this download every time the package is loaded so we would have to work out a way to store the data locally (the user may not have write access to where the package is initially installed so we cannot assume to store it in the package directory).
Another approach would be to create a smaller package, which can be submitted to CRAN, and a separate data package with the full dataset, which can be hosted on another (non-CRAN) site. The functions in the small package would include code to make sure the full data package is available. This approach, using the drat package, is described in this paper: Hosting Data Packages via drat: A CaseStudy with Hurricane Exposure Data
We could also provide a function, using the GeyserTimes API, to access the individual geyser data for a range of dates. But this would likely result in more requests on the server.
A data set with the names and locations (both lat, long and a text description) of all the geysers in the database should be included.
I have no particular analysis in mind for the geyser data. I hope the members of the GeyserTimes team who know more about geysers would give suggests and we would make sure the data in the package is appropriate for these analyses.
A vignette for the package that illustrates some some simple visualizations, data manipulation and analyses should be included. Analysis of inter-arrival times for eruptions of a geyser with a long history would be one simple analysis. Cross-correlation of eruption times at some of the geysers is another analysis that could illustrating working with data and visualizing it.
I like what @spkaluzny put forth for the data storage. I think there are a few other data points that may be worth building out access in R. I will describe all of them below.
As to analysis, once we start processing the data some ideas may come to the team. If we start building out something for geyser analysis it might make sense to have two different packages.
This all looks great. Just as a reminder, I won't have much time to work on this until April.
Sorry for the long silence, yet I've been a bit busy lately.
I'm also guessing the data storage approach suggested by @spkaluzny is the way to go. There are just a few open questions left in my mind:
read_tsv
function)? Note: see https://www.tidyverse.org/ for more information on the tidyverse packages.Tentatively, the data function might look like this: geysertimes::load_data(geyser)
, which takes an optional parameter geyser and returns a tibble with all eruptions in the database.
An example script using this might look like this (just a quick sketch, so don't expect this to run; also, I'm not exactly a R expert so excuse any mistakes):
library(geysertimes)
eruptions <- geysertimes::load_data("Old Faithful")
electronic <- filter(eruptions, E == 1)
electronic <- arrange(electronic, desc(time))
hist(electronic$interval)
This is a completely hypothetical usage for now, as the electronic
data frame currently does not contain the field interval
, which would need to be manually calculated. This is needed quite often and a suitable function should probably be distributed as part of the package.
What are your thoughts? Anyone who has the time is more than welcome to proceed with the implementation 😃. @codemasta14 No worries, as opposed to most of your university work, we don't have any deadlines here so you are free to jump in whenever you can 😉.
Thinking more about how to download the data for the package since the current full data compressed is 16 Mb which far exceeds CRAN's 5 Gb limit.
I don't think hosting a separate data package on another site using drat
as I previous suggested would work. Someone would need to create a new version of the package every time we wanted to make new data available. And we would need a site that can mimic a package repository structure that we can regularly access to put new versions of the package there.
I think having functions to download the data and optionally save it in an appropriate location is the way to go. @TR4Android listed some of the details we would have to work for this.
tempfile()
by default but suggest that the user supply a location so that the data is retained beyond the current session. I have recently discovered the rappdirs
package that suggests many default locations for various types of files (data, configuration, logging, etc) on all R platforms. The get_data
function would download the data and the load_data
would make previously downloaded data available for the current R session. The load_data
function would inform the users if no data had been previously downloaded.tibble
as the data object. A tibble
is created if we use the readr::read_tsv
function to read the downloaded data.The LAGOSNE (Interface to the Lake Multi-Scaled Geospatial and Temporal Database) package downloads data on U.S. lakes. Many of the ideas I described above were developed after looking at how LAGOSNE
handles its data.
I agree with your analysis @spkaluzny, a drat
package is not necessary in our case and would just complicate matters. Let me reply to some of your comments.
get_data
and load_data
function, I think a single function for this should suffice and also is less error-prone. Thoughts?tibble
should work well here, so let's go with this :+1:geyser
parameter would allow users finer control over the downloaded data but that's seldomly needed and can wait till the need arises (if it ever does).Feel free to start with the implementation whenever you're ready. I think the download/load process should be good to go. Further down the road we should think about filtering functions (geyser, electronic, primary, etc.) and other helpers.
Alright everyone, I'm freed up with school for a couple months, and will have time to work on this. I'm going to dive in on what you've said previously, and get to work on developing things. But I do want to ask, has any implementation already been done?
Also as far as updating work as we go along, would you guys like to use a project to keep track of tasks? Would you prefer me to fork the repository and make pull requests, or just give me push permission and have me clone the repository. I'm the junior here, so I want to do things how you guys want to, and not step on toes, but I'm excited and ready to dive in.
I have started on the package and had a similar question about posting / updating the repository. I think we want to post to geysertimes not geysertimes-r-package.
@codemasta14 @spkaluzny Good to hear that work has started on this project! The suggested way for contributing is to create a new branch on this repository, do the work there (e.g. add the function for loading the database), then open a pull request on the master
branch so that we can all discuss and iterate on this. Then, once we are all on the same page regarding functionality and implementation, we'll be merging this. We'll only be taking a moderating role here, so feel free to organize yourself somewhat. You all have write access to this repository so you can clone and push.
tl;dr proposed Development Pipeline (for v1 of this R package)
data-load
)master
branchmaster
branchIf you deviate from this, that's okay. As long as the end result is a organized repository with readable and functional code, we're fine with it. This is just meant to help you get started 😉
I have created a branch, data-load, that consists of an initial full package with functions and help files for getting the data from [https://geysertimes.org/archive/]() and storing an R binary object of the data for subsequent use (in a tibble).
Some notes about the package:
geysertimes
, all lower case. That seems to be the preferred package naming scheme these days.gt_
prefix for the functions since without the prefix the function names are quite common (path
, version
, etc)gt_get_data
, (typically only done once) and one for loading that downloaded data in subsequent R sessions, gt_load_data
.tempdir()
to meet the CRAN requirements that no files be created by default other than under tempdir()
. The gt_get_data
does suggest that the user use the value returned by the gt_path
function as the download location. The gt_path
function sets an appropriate location based on the user's OS.yyyy-mm-dd
.DESCRIPTION
file currently lists me as author, maintainer. We need to decide who to add and in what role.Great. What can I do to add onto this and help?
It would be good for the team to try out the package. Feedback on the design and its usage is welcome.
Hey everyone, I'm sorry I've been out of touch for the last couple weeks. I've been in the process of moving across the country for my summer internship. I'll be still working on this, but will have a little less time to do so until August when I finish up here.
@codemasta14 No worries here, this package is done when it's done. We're not on any schedule here, so there's absolutely no rush. Have fun with your internship!
There are two main sources for obtaining the GeyserTimes data in a machine-readable format:
The choice of source largely depends on whether you want to do a more detailed study on a larger chunk of data or whether you want to analyze the most recent behavior, but for a shorter time frame. Please keep in mind that GeyserTimes is running on donated resources and bombarding the server with requests should be avoided at all costs.
Before starting with the development itself, here are a few questions that we'd like to ask in order to improve the tooling that GeyserTimes provides as well as help guide the package design:
Thanks again for your interest! We're already looking forward to the day this package gets published on CRAN for everyone to use.
Side note: At my university it's exam time and thus my availability will be very limited over the next two weeks as I'm busy preparing for exams. I will still check this thread from time to time and do my best to answer any questions you might have.