example datasets - Githubissues

NPJuncal commented 10 months ago

There are at least two example datasets:

bluecarbon_data: df with example cores core_comp: df with field measurements example data to estimate compaction

task: -homogenize both df to have the same core ids and be able to use core_comp to estimate the compression of the cores in bluecarbon_data -document the example datasets

NPJuncal commented 9 months ago

Hi @Pakillo and @Julenasti,

I want to start the documentation of the example dataset. But I have never done it.

Reading the R Packages (2e) book, it said that I have to create a new script at the R folder. Do I open a new script just like that, or is there a ROxygen tab I have to click to create a R script linked to the dataset? How does R know that that script refers to that dataset?

Could any of you made a example with one of the datasets? Just create the script, no need to do the documentation.

Julenasti commented 9 months ago

Hi Nerea, I have never documented a dataset but I have been reading and it seems very similar to documenting a function and simple. I understand that you have read this (https://r-pkgs.org/data.html) - just to know if we are consulting the same information. I think we need point 7.1 Exported data. Also consider "7.1.1 Preserve the origin story of package data", which is always nice to have for future improvements. To document it, I would use the template that they use, no? Or adapt one of their full scripts https://github.com/tidyverse/tidyr/blob/main/R/data.R

#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report ...
#'
#' @format ## `who`
#' A data frame with 7,240 rows and 60 columns:
#' \describe{
#'   \item{country}{Country name}
#'   \item{iso2, iso3}{2 & 3 letter ISO country codes}
#'   \item{year}{Year}
#'   ...
#' }
#' @source <https://www.who.int/teams/global-tuberculosis-programme/data>
"who"

Answering your question, they say that you document the name of the dataset and save it in R/.

There are two roxygen tags that are especially important for documenting datasets:

@format gives an overview of the dataset. For data frames, you should include a definition list that describes each variable. It’s usually a good idea to describe variables’ units here.
@source provides details of where you got the data, often a URL.

Never @export a data set.

Pakillo commented 9 months ago

Hi!

Yes, just put an .R script in the R folder (ie. together with the functions) documenting each dataset using Roxygen. Here you have an example, with a similar structure to that shown by Julen above.

You could have one R script within R folder documenting all datasets, or one script per dataset, as you prefer (if in doubt I think I'd go with the former option, ie. one script documenting all datasets, could be called datasets.R or similar)

Then save each dataset (as rda?) with the corresponding name within the data folder (could use usethis::use:data)

I agree with Julen (and Hadley) it'd be good to show clearly the origin of the datasets, and any modifications you may have applied. If you are using the datasets exactly as they're available somewhere (with a doi),could just use the @source Roxygen tag as in the above example. If you are modifying the data somehow, then I'd save an R script downloading and modifying the data within the data-raw folder

P.S. Apologies I've been a bit out of touch lately with too much stuff going on. I'll try to get a couple of days soon to focus on this pkg

NPJuncal commented 8 months ago

I have documented the datasets. Someone should check it

Julenasti commented 8 months ago

Hi Nerea, Great job! here are some comments:

Can you give a more descriptive title?
Then add @description to explain what it contains. Here an example: https://github.com/tidyverse/tidyr/blob/c6c126a61f67a10b5ab9ce6bb1d9dbbb7a380bbd/R/data.R#L3 Knowing very little about the topic, I'd appreciate indicating a little more about what blue carbon data you are talking about. Seagrass, salt marsh and mangrove?
I think it's always species, both in singular and plural (not specie): https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L13
Can you explain a bit more what compression means? https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L14
with the? same in maxd https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L15C18-L15C63
Not sure what age means. sampling year or years since sampling? https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L20
It no longer represents any existing data set? https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L23
Also, it may be good to indicate why you modified the data. To cover different cases that may occur when using the package?

The title and description suggestion also applies to the second example.

into the soil? The same in external_distance I don't understand this one very well but it's probably due to my lack of knowledge https://github.com/EcologyR/BlueCarbon/blob/31c26323b3bffb64051be3622d2e58665911e61a/R/exampledata.R#L40
Can you give more descriptive names to the datasets, considering what they contain? And I guess it's enough to add the dataset names at the end and at the beginning just add the title like here: https://github.com/tidyverse/tidyr/blob/c6c126a61f67a10b5ab9ce6bb1d9dbbb7a380bbd/R/data.R#L31

EcologyR / BlueCarbon

example datasets #42