DrylandEcology / rSFSTEP2

R program that interfaces with the STEPWAT2 C code and runs in parallel for multiple sites, climate scenarios, disturbance regimes, and time periods
0 stars 1 forks source link

Error with multiple threads writing to single SQLite database #216

Closed chaukap closed 4 years ago

chaukap commented 4 years ago

The error message is

Error in { : task 2 failed - "database is locked"

This error is rare: I can get about 10 Gb of data into the SQLite database before the last thread dies, but it is pervasive. It occurs due to RSQLite not knowing how to handle multiple threads all attempting to write at the same time.

I can think of two solutions: 1) Investigate the RSQLite package more and see if there is a parameter I can set that fixes this. 2) Use some other type of lock. This isn't ideal because it means installing another library.

chaukap commented 4 years ago

After attempting a few solutions I'm thinking the best option is a mutex lock. Mutexes are a type of object that can only be "held" by one thread (process) at a time. A thread "grabs" the mutex by calling a function like lock(mutex). In this example:

1. m <- mutex()
2. for(thread in 1:4) %dopar% {
3.     db <- dbConnect("somedatabase.sqlite")
4.     lock(m)
5.     dbWriteTable("sometable", somedata)
6.     unlock(m)
7. }

All of the threads can connect to the database at the same time, but only one thread can write to the table at a time. The other threads must wait for the mutex at line 4.

Mutexes are rare in R, but there is a library called synchronicity which provides support.

chaukap commented 4 years ago

I've tested the mutex solution with 2, sites, 5 climate scenarios, and 6 threads and it seems to have resolved this issue. @kpalmqui the you get a chance could you test this branch to confirm? Thanks!

kpalmqui commented 4 years ago

@chaukap what weather database did you use? And what version of rSOILWAT2?

I just pulled the latest commit on this branch, used the "VicSites" weather database for 2 sites, 3 climate scenarios and 2 threads on rSOILWAT2 master, and got the following error:

Kyles-MacBook-Air:rSFSTEP2 kylepalmquist$ ./run_local.sh 2 Loading required package: rSW2utils Loading required package: rSW2utils Loading required package: foreach Loading required package: foreach Loading required package: iterators Loading required package: iterators Loading required package: parallel Loading required package: parallel Error in { : task 1 failed - "unused argument (na.rm = TRUE)" Calls: source ... withVisible -> eval -> eval -> %dopar% -> Execution halted Error in { : task 1 failed - "unused argument (na.rm = TRUE)" Calls: source ... withVisible -> eval -> eval -> %dopar% -> Execution halted

It is possible I have made some mistake.....

chaukap commented 4 years ago

@kpalmqui The function dbW_estimate_WGen_coefs of MarkovWeatherFileGenerator.R no longer uses the na.rm argument in newer versions of rSOILWAT2. If you remove it from the parameter list the function should work. Sorry about that! I committed the fix in feature_phenology so I didn’t commit it in this branch.

kpalmqui commented 4 years ago

@chaukap ahhhhh yes. Thanks! The code is executing now. Will report back.

kpalmqui commented 4 years ago

@chaukap hit another error:

[1] "rgroup treatment fixed1 done." [1] "GCM Current DONE" [1] "rgroup treatment fixed1 done." [1] "GCM Current DONE" [1] "rgroup treatment fixed1 done." [1] "RCP RCP45 DONE" Error in { : task 2 failed - "cannot change working directory" Calls: source ... withVisible -> eval -> eval -> %dopar% -> Execution halted [1] "rgroup treatment fixed1 done." [1] "RCP RCP45 DONE" Error in { : task 2 failed - "cannot change working directory" Calls: source ... withVisible -> eval -> eval -> %dopar% -> Execution halted Kyles-MacBook-Air:rSFSTEP2 kylepalmquist$

Looks like it never made it to the final GCM, which was associated with RCP8.5.

kpalmqui commented 4 years ago

@chaukap I also tested with two sets of parameters in InputData_Rgroup.csv (fixed 1, fixed 2) but just one set for InputData_Species.csv and InputData_SoilLayers.

chaukap commented 4 years ago

@kpalmqui That is strange... I'm unable to replicate this error, but I do remember that when my Ubuntu machine was running out or memory during testing I would get the error

Error in { : task 2 failed - "cannot change working directory"

Is it possible your laptop is running out of memory? This branch is more memory intensive because there are multiple threads handling daily SOILWAT2 output at the same time.

chaukap commented 4 years ago

@kpalmqui come to think of it I bet that is the problem. Looking at my system memory as it runs the wrapper it looks like each site requires about 1.4 Gb of memory. If you're running 2 sites and 5 climate scenarios concurrently that's about 14 Gb of memory.

I can see two solutions:

  1. The obvious way to reduce the memory is to no longer store daily SOILWAT2 output.
  2. Otherwise, I could add a lock around the code that reads in and stores the daily weather like this:
    lock(daily_lock)
    read in the daily CSV
    lock(database_lock)
    add the daily CSV to the SQLite database
    unlock(database_lock)
    remove the daily CSV from memory
    unlock(daily_lock)

    This would ensure that only one thread can hold daily SOILWAT2 output in memory at a time.

kpalmqui commented 4 years ago

@chaukap thanks for following up on this! Option 2 seems to be our best bet. Do you have any idea of how much this would slow down the program?

kpalmqui commented 4 years ago

@chaukap an update. I just tested on my office Desktop (32GB memory) and monitored memory use when the simulations were running. Memory was never exceeded and I still ran into the same error:

palmquist-mac:rSFSTEP2 palmquist$ ./run_local.sh 2
Loading required package: foreach
Loading required package: foreach
Loading required package: iterators
Loading required package: iterators
Loading required package: parallel
Loading required package: parallel
Impute missing `mkv_prob` values for n = 1 DOYs: 283
storing SOILWAT output aggregated across-iterations (-o flag)
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
STEPWAT  init_args() executed successfully 
storing SOILWAT output aggregated across-iterations (-o flag)
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
STEPWAT  init_args() executed successfully 
Number of iterations: 1
Number of years: 300
Number of iterations: 1
Number of years: 300
Number of iterations: 1
Number of years: 300
Number of iterations: 1
Number of years: 300
Done                                    
Done                                    
Done                                    
Done                                    
[1] "rgroup treatment fixed2 done."
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
Number of iterations: 1
Number of years: 300
[1] "rgroup treatment fixed2 done."
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
Number of iterations: 1
Number of years: 300
[1] "rgroup treatment fixed2 done."
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
Number of iterations: 1
Number of years: 300
[1] "rgroup treatment fixed2 done."
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully 
Number of iterations: 1
Number of years: 300
Done                                    
Done                                    
Done                                    
Done                                    
[1] "rgroup treatment fixed1 done."
[1] "GCM Current DONE"
[1] "rgroup treatment fixed1 done."
[1] "GCM Current DONE"
[1] "rgroup treatment fixed1 done."
[1] "RCP RCP45 DONE"
Error in { : task 2 failed - "cannot change working directory"
Calls: source ... withVisible -> eval -> eval -> %dopar% -> <Anonymous>
Execution halted
[1] "rgroup treatment fixed1 done."
[1] "RCP RCP45 DONE"
Error in { : task 2 failed - "cannot change working directory"
Calls: source ... withVisible -> eval -> eval -> %dopar% -> <Anonymous>
Execution halted
palmquist-mac:rSFSTEP2 palmquist$ 

Here are the inputs I ran with.

InputData_Rgroup.xlsx InputData_SoilLayers.xlsx InputData_Species.xlsx

Main.txt

kpalmqui commented 4 years ago

@chaukap I figured out the problem - which is pervasive through all of our branches.

The code fails when you specify only particular RCP-GCM combinations (and not all). For example:

#Specify the RCP/GCM combinations
climate.conditions <- c(climate.ambient,  "RCP45.CanESM2", "RCP85.MRI-CGCM3")

I tested using these conditions in both this development branch and the phenology development branch and both runs failed with the error above.

Once I removed the last scenario above, the runs for the streamline_output branch completed.

chaukap commented 4 years ago

@kpalmqui That's interesting... I'l look into it today and we can discuss in our meeting.

chaukap commented 4 years ago

@kpalmqui commit 7fb6985 should prevent the bug you described.

kpalmqui commented 4 years ago

@chaukap I ran a test run overnight on Teton and the code did not fail due to exceeding memory (128 GB). This is good news. However, we can't expect that all machines will have this level of memory. I think it is wise then to implement the change we were discussing with regard to the SOILWAT2 daily output to reduce the memory load? What do you think?

chaukap commented 4 years ago

@kpalmqui I think it's a good idea what will allow the program to run on more machines. I'll implement the changes and get a test run going now.

kpalmqui commented 4 years ago

@chaukap great - thank you!

kpalmqui commented 4 years ago

@chaukap I tested this on a local machine and Teton and all seems to be working as expected with a few exceptions. Was this was you indeeded the print statements to look like? Specificially, ^MDone

^MDone
[1] "RCP RCP85 DONE"
[1] "YEAR d50yrs DONE"
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully
Number of iterations: 1
Number of years: 300
^MDone
[1] "RCP RCP85 DONE"
[1] "YEAR d50yrs DONE"
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully
Number of iterations: 1
Number of years: 300
^MDone
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully
Number of iterations: 1
Number of years: 300
^MDone
storing SOILWAT output aggregated across-iterations (-o flag)
STEPWAT  init_args() executed successfully
Number of iterations: 1
Number of years: 300
^MDone
[1] "RCP RCP85 DONE"
[1] "YEAR d90yrs DONE"
[1] "Soil treatment  soils_fixed2  DONE"
[1] "Species treatment  species_fixed2  DONE"
[1] "GCM FGOALS-s2 DONE"
[1] "RCP RCP85 DONE"
[1] "YEAR d90yrs DONE"
[1] "Soil treatment  soils_fixed2  DONE"
[1] "Species treatment  species_fixed2  DONE"
[1] "GCM MRI-CGCM3 DONE"
[1] "Site 1 Done"
     user    system   elapsed
13057.744   656.482  8532.662

I will post the other exception in a new issue.

chaukap commented 4 years ago

@kpalmqui I'm assuming the "Done" part is from STEPWAT2. As for the "^M", this is the output that you would see from hitting ctrl + M on the keyboard. ctrl + M has no meaning in bash aside from "Enter", so it shouldn't impact the program.

This is strange behavior... did this error appear in the local run or the Teton run?

kpalmqui commented 4 years ago

Teton run

chaukap commented 4 years ago

@kpalmqui Unfortunately I was unable to replicate the ^M in output. Has it caused any serious problems?

kpalmqui commented 4 years ago

@chaukap not that I know of.

kpalmqui commented 4 years ago

Resolved by the above commits