Closed djhocking closed 9 years ago
Ok, I should probably reiterate that some of the code in the serverWorkflow document was not intended to actually be used on the server. I was loading Rdata files from subdirectories just so I could show the structure of the datasets in that document.
But on the server, all of the input datasets will be loaded from csv files that are generated by the SQL scripts, and then saved as RData files. Specifically, these input datasets are: temperatureData
, covariateData
, and climateData
. So given those three datasets, we should be able to use the scripts to generate all the other datasets (e.g. masterData
, tempDataSync
, etc.). (Let me know if I missed any other input datasets, but I think those three are the only ones)
But in order to get this system running, it would be useful to have a set of input datasets that we could use for testing the workflow. So if you could put three RData files for temperatureData
, covariateData
, and climateData
on dropbox, then we can use those for running the scripts until we have the database all set up. Once the database is set up then those input datasets will be pulled directly from the database using the SQL scripts. I don't think we need to share the other intermediary or final datasets unless we have problems running the scripts to generate them from the input datasets.
Does that make sense?
This makes sense but I guess I am unclear where the intermediate datasets go. They get saved as *.RData
files while bash is running the series of R scripts, but I'm not sure where they get saved and whether it's just temporary in memory. I guess my impression is that they get saved to the server just like they would get saved to the hard drive locally, and then they get overwritten next time the model is run. If that's true it might be worth thinking about whether any outputs should get saved longer or not.
Somewhat related, do you need to know exactly what columns are in each file? For example, if you have the structure of tempDataSync
do you also need the specifics of tempDataSyncS
since it will be very similar with just a few different columns?
I will see if I have or can generate those three RData files. They are all from Kyle's code and I've just accessed the info through the readStreamTempData
in the past.
Ignore the last comment, I found the explanation of the process in the serverWorkflow.Rmd
Yeah, just to clear this up. The intermediate datasets are saved as *.RData folder in the current simulation directory. Again, each time the model is run all of the input, intermediate, and final datasets are saved as RData files in the folder named by the current timestamp (e.g. 20141110_1710/) so that we aren't overwriting anything. We'll err on the side of caution and save everything for now. If disk space becomes a problem, we can start deleting the input and intermediate RData files after the model run is complete. Also note that this folder can store the diagnostic/debug output pdfs that we talked about before.
And right now, I don't need to know what the columns are. Once we start looking into saving the output in the database, I will but for now we'll just save them as Rdata files.
Right now the
serverWorkflow.Rmd
file looks for data in sub directories to show the structure:However, I haven't moved all the
.RData
files to this new repo because I didn't want it to get bloated and unwieldy. I could move the data over or I could just get it to you via dropbox and then the repo could reference a local directory that is ignored by git. The files shouldn't be too big except maybe themcmc-list.RData
output from JAGS.Should I add the data or reference a local data subdirectory?