Remove redundancies in rSFSTEP2 flow

DrylandEcology / rSFSTEP2

R program that interfaces with the STEPWAT2 C code and runs in parallel for multiple sites, climate scenarios, disturbance regimes, and time periods

0 stars 1 forks source link

Remove redundancies in rSFSTEP2 flow #215

Closed chaukap closed 4 years ago

chaukap commented 4 years ago

Currently rSFSTEP2 has a few redundancies when dealing with STEPWAT2 output. The flow is: 1) Run STEPWAT2 to generate output 2) Rename that output file according to species, soil, site, GCM, etc. 3) repeat 1 and 2 for all species, soil, site, GCM, etc. combination. 4) Iterate across all output (CSV) files and combine them into master CSV files called "total_bmass", "total_mort", ect. 5) Combine these master CSV files into an SQLite database. 6) Delete the CSVs

Note that this occurs for each site.

Proposal

Steps 2 - 4 are unnecessary. Instead, we should add each output from STEPWAT2 to the SQLite database immediately after they are generated: 1) Run STEPWAT2 to generate output 2) As soon as output is generated, insert it into the SQLite database along with the species, soil, site, GCM, etc. information. 3) Delete the CSV

This will save time and memory.

chaukap commented 4 years ago

I did some research and I realized that I need to open separate connections to the database for each parallel instance. I'll address that in the next commit.

chaukap commented 4 years ago

@kpalmqui My latest test run on this branch completed with no errors so I believe issue #216 has been resolved. However, I think we can improve the copy data.sh script as well. Namely:

The script copies the SQLite databases for each site into a single folder. Instead of using copy we should use move to reduce the file structure slightly.

Also, I'm a little confused about the CombineOutputDatabases.R script. It isn't mentioned in the README file and I was wondering if you actually use it. If you do we need to add it to the README and there are a few changes I need to make. If you don't we could probably get rid of it.

kpalmqui commented 4 years ago

@chaukap wow this is awesome!

the copydata.sh is meant to put all of the databases in a single folder so they can easily be moved off the supercomputer to a local machine via globus or secure copy. I am OK with moving the databases instead of copying them.

I utilize the CombineOutputDatabases.R all of the time to compile all of the individual databases into one master database. Furthermore, this allows me to build a master database that may or may not have all of the Tables that are in the individual databases. I then connect to and query the master database in R.

I agree that we need to add it to the README.

Would now be a good time for me to test this branch?

chaukap commented 4 years ago

@kpalmqui I changed the naming convention for files slightly so need to make a small change to CombineOutputDatabases.R. Once that's done it would be great for you to test!

chaukap commented 4 years ago

@kpalmqui One last question: once we compile all of the site-specific databases into a master database in CombineOutputDatabases.R should I the delete the site-specific databases? It would cut memory usage in half but I don't know if you need those databases.

kpalmqui commented 4 years ago

@chaukap let's not delete the individual databases. thanks!

chaukap commented 4 years ago

@kpalmqui Ready to test!

One thing to note:

I shortened The table names in the SQLite databases. It seemed redundant to call all of the tables "total_" so I removed the prefix.

total_bmass -> Biomass
total_mort -> Mortality
totalsw2 -> sw2_