`fortedata` package goals

bpbond commented 5 years ago

Hello @ashiklom @atkinsjeff @cmgough

It was an interesting idea to provide the students with a one-stop, easy to use FoRTE R data package. Everyone seemed to have questions/difficulties with the Google Drive download step, which I can understand, and this might make it much easier.

I think it would be good to build slowly into this, i.e. do the easiest things first, test, expand, test. A proposal:

First priority: things that everyone needs and will never change:

Plot/treatment assignments
Subplot orientation
Lon/lat of plots

Second priority: data that many people need and will change only slowly:

Tree inventory data
Local climate data (from Ameriflux tower)?
...?

Third priority: student-specific data, e.g.:

Kayla's raw Licor data and collar/subplot assignments
Lisa's leaf physiology data
Max's raw diameter data

Thoughts? Thanks.

ashiklom commented 5 years ago

Sounds great! Are the "first priority" data already available somewhere? If so, I can take some time to work on this this week.

Some more ideas for "Second priority":

Functions to download climate reanalysis products for UMBS (for comparison/cross-calibration against local climate data)
Categorical trait data (e.g. from USDA Plants, TRY) for UMBS species. (Maybe also subsets of pubic quantitative trait data from TRY as well? Could be useful for comparison)
Functions (and/or downloads) of remote sensing data at UMBS -- definitely Landsat and MODIS, possibly older AVIRIS flights, others?

cmgough commented 5 years ago

Fantastic! Jeff likely has the first priority data hat his fingertips, since he's been playing with the stem map data.

For the second priority:

Tree inventory data: it would be slick if we could apply allometries and scalers for the derivation of wood NPP since this is a key measurement of use to many.
Local climate data: could pull in flux tower data and data from forthcoming installed met stations that Jeff and I hope to install before the girdling treatment.

bpbond commented 5 years ago

this week

I would like to be careful here. If we build this, and the students start to use it, that's great...but then any significant subsequent change will break their scripts. So it's worth taking a bit of time to think about design options, and try it privately (i.e. among us), before releasing it more broadly.

The simplest but almost most fundamental thing to start with might be this table:

Replicate	Plot	Longitude	Latitude	Treatment	Area_m2

This information is all (i) already known and (ii) will never change, right? @atkinsjeff ?

We also want a consistent naming scheme, e.g. this might be fd_plots. Thoughts?

The advantage of a data package is it should be dead simple to use, but we'll still want a wiki page describing how to install, load, and use it.

ashiklom commented 5 years ago

it's worth taking a bit of time to think about design options

Good point -- I probably jumped the gun a bit. Though I also think the sooner we have something to play with, the sooner we can get a sense of what works.

a wiki page

I agree that this should be documented, but I think it would be good if this lived inside the package itself (i.e. not as a GitHub wiki), partially in the .Rd files and partially as RMarkdown vignettes. That documentation could then be rendered as a website using pkgdown with minimal effort. A really cool idea would be to add vignettes representing real exploratory analyses using the data, especially if many of them were prepared by other FoRTE participants.

ashiklom commented 5 years ago

design options

Although I wouldn't necessarily recommend SQL as a storage backend for this (plain text is conceptually much simpler and easier to version-control), I do think creating an SQL-inspired "database schema" is a useful way to map out the data and their interrelationships and constraints.

bpbond commented 5 years ago

Re documentation–ah, that's a much better idea, agreed.

database schema

Agreed! Plain text much simpler and perfectly adequate for our purposes, as far as I can see, but a clear schema on how things relate would be useful.

atkinsjeff commented 5 years ago

To answer @bpbond yes all those data and metadata exist. As do matching with the inventory data that @cmgough mentioned and no, they will never change. Going with the original plan we talked about earlier I would suggest Replicate (i.e. A, B, C, D), Plot (i.e. 1,2,3,4), Subplot (i.e. E, W), Treatment (i.e. 00, 45, 65, 85), Lat, Long, Area (thought area is constant, but I can see the utility).

Also, Subplot could be changed to the bottom-up, top-down classification (i.e. B and T) and we can drop the subplot thing if desired, but saying treatment twice is confusing. Thoughts? plot_diagram

atkinsjeff commented 5 years ago

Looking back at older files we designated the "treatments" as "disturbance_severity" and "disturbance_approach"

atkinsjeff commented 5 years ago

One thing as I am putting this together. What NEON does is they have a master list of all of their "plots" so this includes the TOWER plots (40 x 40 m), the DISTRIBUTED plots (40x 40m) PHENOLOGY plots (which are like 100 x 20 m or something), VEG SUB PLOTS (2 x 2 m), etc. And what they do is they have a "plot_type" column that designates what type along with an "area" column as @bpbond showed earlier. Do you want me to make the metadata in this fashion? This would include all 9 of the NSPs for each subplot making it (32 * 9) [NSPS] + 32 [subplots] + 16 [plots] columns long. Sound good? Or would you rather it be seperate files for each? My preference is for one master and scripts that parse it, but that's a backend question for you @ashiklom and @bpbond

bpbond commented 5 years ago

Interesting questions. Personally I would prefer to keep separate things separate (see @ashiklom 's comment above about schemas). So this would involve one plots table (mapping disturbance level to replicate/plot combination) , and one subplots table (mapping disturbance approach to plot/subplot combination).

Just committed (in c7e1fa2e909361cc2bf7c03d5c063ec86d3956b6) an example of this, along with a fd_plots() function to return it.

We could then have a fd_plots_subplots() function that would return a unified table like this (but ~perhaps~ using base R only)

fd_plots() %>%
  left_join(fd_subplots(), by = c("Replicate", "Plot"))

ashiklom commented 5 years ago

using base R only

base::merge!

merge(fd_plots(), fd_subplots(), by = c("Replicate", "Plot"))

bpbond commented 5 years ago

@atkinsjeff Just in terms of learning–at this point you can clone the repo to your computer, open the RStudio project file, select Build->Install and Restart, and the fortedata package will be loaded.

Now type fd_plots() to see the dummy plot table, and ?fd_plots for its help page. The code for this function is in R/ and the data are in inst/extdata/.

atkinsjeff commented 5 years ago

Where do we want this .csv to go?

bpbond commented 5 years ago

The .csv file lives in inst/extdata–you can see the sample one I put here: https://github.com/FoRTExperiment/fortedata/blob/master/inst/extdata/fd_plots.csv

FoRTExperiment / fortedata

`fortedata` package goals #1