SimonGoring / neotoma_paper

Repository for the neotoma package paper.
MIT License
3 stars 2 forks source link

Text refers to a `download` object but this doesn't exist #5

Closed gavinsimpson closed 10 years ago

gavinsimpson commented 10 years ago

Throughout, the manuscript refers to an object download, which is neither created nor exists as a class (from what i can tell anyway.)

For example

It is not clear what the intention is here? Are you wishing to refer to a type of object? I.e. each of the downloaded data sets is stored in what you think of as a download object? If so, neotoma should put a class on those objects - at the moment they are just class "list".

However, if the intention was to refer to a named object that was created earlier, then there is an inconsistency in the example code used.

SimonGoring commented 10 years ago

Hi @gavinsimpson, maybe you can help me with this (probably for a later version of the package, but so that the capacity is there now). I was hoping to make 'download' a class of its own, along with dataset and site, so that they would be nested within each other. The hope would be that I could search for a site using get_site, pass any returned sites into get_dataset and then pass those into get_download.

For now I'd just want to make a download class (for example) that simply borrows all its methods from list, presumably using the call class(x) <- c('download', 'list'). The same for site and dataset. Is that the best way to do this? I think it messes up the print function a bit, but it gives me room to grow if we get our next grant :)

SimonGoring commented 10 years ago

I guess the other question is, do I stick the class definitions in another R file in the R directory?

gavinsimpson commented 10 years ago

To some extent the answer to the first comment depends on how far down the rabbit hole you want to go. The S3 class systems is really easy and cheap but you can't enforce that a download object contains dataset components etc. This might not matter as long as you (neotoma) are controlling how these objects are created and what S3 methods you write for the classes. If you want to have a lot of control over the objects and method dispatch etc then S4 is the way to go.

For this, I would suggest S3, at least from what I have seen of the package to date.

As for the second comment, you just assign the class to the object before you return it in the relevant function using as you say:

class(x) <- c('download', 'list')

You don't need to define classes beyond this for S3 as essentially the act of setting the "class" attribute is enough to "define" that class. In another sense, the definition of the class is whatever S3 methods you write for the class will accept as input and work with, without error.

For now, to make the text mesh with the package, I'd attach an S3 class in exactly the manner you indicated, inheriting from class "list" so you don't have to provide a range of methods for the basics like subsetting.

If you want to document the class, or rather the components that should be present in an object of class "download", then you can add an Rd page with name and alias class-download say. You then provide title, description, and detail sections which include info on what components are present and what they contain. You don't have to do this to use the class, so for the time being it might be prudent to just set the class and document it later once the package has settled down a bit more. To actually do this, you just write Roxygen2 data for the indicated parameters and have the only non-documentation info be NULL. For example I have something like this in my coenocliner package

##' @title A coenocline simulation package for R
##'
##' @description \pkg{coenocliner} provides a simple, easy interface for simulating species abundance (counts) or occurrence along gradients.
##'
##' @details One of the key ways quantitative ecologists attempt to understand the properties and behaviour of the methods they use or dream up is through the use of simulated data. \pkg{coenocliner} is an R package that provides a simple interface to coenocline simulation.
##'
##' @author Gavin L. Simpson
##' @docType package
##' @name coenocliner
##' @aliases coenocliner-package
##' @keywords package
##' @seealso \code{\link{coenocline}} for simulating species data, \code{\link{distributions}} for details of the error distributions tha can be used for simulations, and \code{\link{species-response}} for details on the available species response models and the parameters required to use them.
NULL

That is for a package documentation file but you would include similar information just leave out the @docType package line.

The only change that will happen to the print()-ed output if you add the class as you indicate is that it may print out the class attribute information. To suppress this you could easily add a print method that only need to be this

`print.download` <- function(x, ...) {
    class(x) <- "list"
    print(x, ...)
}

and you'd be back to the same printed output. However, I would suggest that we don't just allow objects to print in their native representation in case someone is working with big objects - you just pollute the output with printed material and make the user wait until R has finished printing if someone inadvertently print()s a big object.

You could have a print.download() that just wrote out a few text lines indicating that this was a neotoma download object consisting of n objects for example.

SimonGoring commented 10 years ago

Okay, perfect. This is what I was thinking of anyway. I think the download print would probably have a sitename, lat/long, age range and number of samples, or something like that. Basically, print out a data.frame to represent all the download data.

Thanks Gavin.

gavinsimpson commented 10 years ago

I would suggest you have that as a summary.download() function which returns an object of class "summary.download". You then write a print.summary.download() function to print that object.

If you have a modest data set downloaded you don't want to dump even the summary information to the screen when a user just types in the name of the object. For large objects, R will stop printing before you print the whole thing anyway so this is a bit redundant. Also, you don't really want to be doing much processing in a print method in general. Have a summary() method allows the user to generate that summary object (data frame) from the actual object, and then print it if they wish. You could also have the print.summary.download() method only print the first n rows and a continuation line ...., just like the new dplyr package does.

The main point is that R-wise, print() shouldn't do much computation and shouldn't return things of use. summary() methods are a good way to handle that.

Some additional thoughts. You might want to add to the "download" object list data on the date/time of accessing the data and or any other meta data that might go with the API call. You could add that info to be displayed in the print method for example.

SimonGoring commented 10 years ago

There is now a download and download.list object, each with print methods, The print method is fairly simple, it tells whether there's a single object or multiple objects, and gives the download time. This was all pushed in commit https://github.com/SimonGoring/neotoma/commit/682d648ff30224126d0fe16247919f5d45831f6e and will be extended. I'm going to do the same for datasets and sites. These objects are basically nested, so one should fit into the other and vice versa (one should be extractable from the other). Using the classes will ideally make this easier.