DistanceDevelopment / spatial-workshops

Distance sampling workshop content
http://distancesampling.org/workshops/
2 stars 0 forks source link

Stage-by-stage data files #36

Closed dill closed 7 years ago

dill commented 8 years ago

As we talked about: could you send the gdb files as they will be at the start of each R practical session. At the moment the practicals are written "against" the final Analysis.gdb file, need to check that the data I use will be available at the right time.

jjrob commented 8 years ago

Go to https://duke.box.com/DSMWorkshopStaging

I have posted two files so far. The Day1 file represents what we have following the day 1 morning computer lab session. Day 2 is following day 2 morning session. Later on I will post days 3 and 4.

The files are zips of the "project directory" that should be used as the R working directory. Inside is the Analysis.gdb as well as directories of covariates, and other stuff supporting part of the GIS analysis. On days 1 and 2, most of these other directories are empty because don't start on environmental covariates until day 3.

As part getting started on day 1, the students will install a version of these zips in whatever place the IT people give us for persistent storage. These zips here are not that final version we will use for the workshop. I still have to annotate some Arc workflow that is inside Analysis.gdb (you cannot see it from R) and other minor things like that. So I want to make sure there is no confusion: the staging files I am sending here are for testing only. We will have to test again with the final version, which will probably not be ready until Monday morning (but possibly sooner).

dill commented 8 years ago

I'm done with 1 and 2 now, as you can tell from my previous issue closures. All good so far.

:fireworks: :fireworks: :fireworks:

jjrob commented 8 years ago

I posted 3 and 4 to duke.box.com.

3 is what you will use to run exercise 3-advanced-dsms. The difference between the 2 and 3 zips is that the covariate columns were added and populated in Segments_Centroids in Analysis.gdb. I did not include the original covariate rasters (in the Covariates directory) because they take 800 MB and you do not need them for testing.

4 is what you will use to run exercise 4-prediction. The difference between the 3 and 4 zips is that the Covariates_for_EEZ and Covariates_for_Study_Area directories are populated in 4. Nothing in Analysis.gdb changed. (These rasters do not take 800 MB because they are clipped down to the study areas and reprojected to 10 km scale, making them much smaller.)

Please check these carefully to verify they produce the model results you expect. I checked the Segments_Centroids and verified that the covariate values are exactly the same as what you used to build your exercises. But I did not check the rasters byte-for-byte. (I highly doubt they have changed, though, since the values sampled by Segments_Centroids were identical.)

Let me know how it goes. I will leave this issue open until I have prepared the final zip files that we will bring to the workshop. There will be two of them. The first will be "empty" with only the workflow, survey CSV files, and pre-downloaded data necessary to run the entire thing. Assuming nothing goes haywire, the students will go through the workflows in here using Arc and build out the Analysis.gdb and rasters.

The second final zip will be a "completed" version that will be our backup in case something goes haywire (e.g. the network goes down, or a student has problems with some part and can't keep up). It will have the Analysis.gdb and rasters all done as if the workshop was completed. If everything goes to hell, the students should be able to execute all of the R code against this version without needing to do anything in Arc at all.

jjrob commented 8 years ago

Assigning to you to make sure it shows up on your radar. Assign back to me when you're done testing 3 and 4.

dill commented 8 years ago

Thanks for these!

For 3, copying over the fitted detection functions (df-models.RData) and the process-geodata.Rmd files then running process-geodata.Rmd yields the correct files to run 3-advanced-dsms.Rmd. this works well and all the results I expected to see are there.

For 4, need to copy over:

Having done that running 4-prediction.Rmd goes fine. 5-variance.Rmd has highlighted a bug that is fixed on github but not on CRAN (in dsm.var.prop) so I guess my fun today is submitting a new version of dsm to CRAN...

jjrob commented 8 years ago

Will that change be propagated through CRAN in time for the workshop? Or will we need to install dsm from github instead?

On a related note, maybe should be a different issue: it is possible that the Nicholas School IT folks or someone else (not sure who) will have installed some packages already. In the beginning of the workshop, we should do whatever we deem necessary to bring the machines all to a consistent state (e.g. update.packages). For example, my understanding is that Simon recently rolled something into mgcv relating to robustness of confidence interval estimation, or something else major that had an effect on confidence interval estimation. Anyway, if you have been building your models with the latest mgcv but the lab machines are on an older version, their results could be different. Ideally, they should get the exact same results as you. (I don't know whether mgcv's optimizers are that deterministic, but they probably are good enough, unless are presenting them with a wacky situation.)

dill commented 8 years ago

Propagation through CRAN usually takes 24-48 hours IIRC, so it should be okay (I think propagation to RStudio servers, which is default for RStudio is fast) -- so provided it's accepted today, the rest of the process will be automated over the weekend. Participants who installed the packages on their own machines this week will need to run update.packages() but we can recommend they all do that to begin with anyway.

dill commented 8 years ago

Okay actually there are a number of CRAN-based roadblocks at the moment (no least the fact there is no R-devel installer available for Mac at the moment), so let's also teach participants that installing from github is easy and a much better way to get an updated version of the package.

Is that okay with you @jjrob?

jjrob commented 8 years ago

That is ok with me. It may be a bit excessively nerdy for the not-so-nerdy in the room, but the nerds will probably enjoy the process.

jjrob commented 8 years ago

I forgot to say that it might not actually be possible. I think on Windows you have to have RTools installed in order to install from github. I don't know if the lab machines will have RTools, and if they don't, I can't remember if you have to be an admin to install RTools. (I don't think so.) Or whether we want to endure the complexity of that.

Anyway, we need to be sure to investigate a github install on Windows. (I haven't done it in a few months so I forget.)

jjrob commented 8 years ago

There is also the possibility of just patching the problem file, on the fly.

dill commented 8 years ago

oh crap. Okay, well let me think about that...

dill commented 8 years ago

Okay, another option is to just not talk about dsm.var.prop in slides/9-variance.* or exercises/5-variance.Rmd and save this for the advanced topics lecture? Given this is legitimately an advanced topic, I think this might be a better solution. Thoughts?

jjrob commented 8 years ago

Can we talk a bit more about the bug? I think the difference between dsm.var.prop and dsm.var.gam is pretty important, especially because it is not explained in the literature very well (that I know of) in terms that ecologists with modest statistical experience will understand. E.g. Williams et al. (2011) kind of hand waves about it, while the appendix in your recent overview paper contains scary equations. I think it is good to have something in between, and to let people play with it. the current 5-variance material does that pretty well.

I'm on skype.

jjrob commented 8 years ago

Ok just re-read Williams et al. (2011) and must retract what I said about hand waving. It is the other category--to hard for non-statisticians to understand. As soon as you start talking about things like, e.g., design matrices and Hessians, you have lost people.

dill commented 8 years ago

Solution:

  1. Make binary using winbuilder and R CMD BUILD on Mac for Mac/Linux.
  2. Download via github release
  3. Install from local zip in RStudio

or install via devtools::install_github("DistanceDevelopment/dsm")

(BUT the code does work if you fit the model during the variance exercises, this is an R environment/workspace issue :sob:.)

dill commented 8 years ago

(Justification for using this approach is that it's actually more realistic for users than either leaving out the content or "monkey patching" dsm -- the package will be updated, the updates take a long time to go on CRAN.)

dill commented 8 years ago

Windows binary online here, need to test on fresh machine.

dill commented 8 years ago

Install from binary works on my fresh install :+1: