Conte-Ecology / conteStreamTemperature_web

Description and scripts for running the temperature model through the SHEDS web application
MIT License
0 stars 0 forks source link

Where to put data for workflow example structures #4

Closed djhocking closed 9 years ago

djhocking commented 9 years ago

Right now the serverWorkflow.Rmd file looks for data in sub directories to show the structure:

load('../../dataOut/tempDataSync.RData')
str(tempDataSync)

However, I haven't moved all the .RData files to this new repo because I didn't want it to get bloated and unwieldy. I could move the data over or I could just get it to you via dropbox and then the repo could reference a local directory that is ignored by git. The files shouldn't be too big except maybe the mcmc-list.RData output from JAGS.

Should I add the data or reference a local data subdirectory?

load('../../dataLocal/tempDataSync.RData')
str(tempDataSync)
walkerjeffd commented 9 years ago

Ok, I should probably reiterate that some of the code in the serverWorkflow document was not intended to actually be used on the server. I was loading Rdata files from subdirectories just so I could show the structure of the datasets in that document.

But on the server, all of the input datasets will be loaded from csv files that are generated by the SQL scripts, and then saved as RData files. Specifically, these input datasets are: temperatureData, covariateData, and climateData. So given those three datasets, we should be able to use the scripts to generate all the other datasets (e.g. masterData, tempDataSync, etc.). (Let me know if I missed any other input datasets, but I think those three are the only ones)

But in order to get this system running, it would be useful to have a set of input datasets that we could use for testing the workflow. So if you could put three RData files for temperatureData, covariateData, and climateData on dropbox, then we can use those for running the scripts until we have the database all set up. Once the database is set up then those input datasets will be pulled directly from the database using the SQL scripts. I don't think we need to share the other intermediary or final datasets unless we have problems running the scripts to generate them from the input datasets.

Does that make sense?

djhocking commented 9 years ago

This makes sense but I guess I am unclear where the intermediate datasets go. They get saved as *.RData files while bash is running the series of R scripts, but I'm not sure where they get saved and whether it's just temporary in memory. I guess my impression is that they get saved to the server just like they would get saved to the hard drive locally, and then they get overwritten next time the model is run. If that's true it might be worth thinking about whether any outputs should get saved longer or not.

Somewhat related, do you need to know exactly what columns are in each file? For example, if you have the structure of tempDataSync do you also need the specifics of tempDataSyncS since it will be very similar with just a few different columns?

I will see if I have or can generate those three RData files. They are all from Kyle's code and I've just accessed the info through the readStreamTempData in the past.

djhocking commented 9 years ago

Ignore the last comment, I found the explanation of the process in the serverWorkflow.Rmd

walkerjeffd commented 9 years ago

Yeah, just to clear this up. The intermediate datasets are saved as *.RData folder in the current simulation directory. Again, each time the model is run all of the input, intermediate, and final datasets are saved as RData files in the folder named by the current timestamp (e.g. 20141110_1710/) so that we aren't overwriting anything. We'll err on the side of caution and save everything for now. If disk space becomes a problem, we can start deleting the input and intermediate RData files after the model run is complete. Also note that this folder can store the diagnostic/debug output pdfs that we talked about before.

And right now, I don't need to know what the columns are. Once we start looking into saving the output in the database, I will but for now we'll just save them as Rdata files.