Running STEMMUS_SCOPE sensitivity analysis on Snellius with parallel computing

EcoExtreML / STEMMUS_SCOPE

Integrated code of SCOPE and STEMMUS

https://EcoExtreML.github.io/STEMMUS_SCOPE

GNU General Public License v3.0

15 stars 6 forks source link

Running STEMMUS_SCOPE sensitivity analysis on Snellius with parallel computing #187

Open Crystal-szj opened 1 year ago

Crystal-szj commented 1 year ago

@SarahAlidoost Hi Sarah,

I hope this message finds you well. I want to do a sensitivity analysis on STEMMUS_SCOPE by setting different sets of parameters (run the model 380 times). I would like to utilize parallel computing to finish this part.

To begin, I have created a new executable file via STEMMUS_SCOPE_SS_exe.m that requires two input parameters: one for the config file (config_file) and the other for the parameters (parameter_setting_file). The Matlab code portion has been completed.

My intention is to utilize the existing _run_STEMMUS_SCOPE_inSnellius framework. If I understand correctly, I need to modify the run_model.py file to iterate through the input parameter file instead of the input forcing data for 170 sites.

The pystemmusscope environment is activated. Now I have a couple of questions:

In the run_model.py file, see here, we need to create an instance of the model. However, the parameter_file is not an input for the StemmusScope class. Does this mean I need to modify the StemmusScope class in 'pystemmusscope' package and reinstall the package?
Since the run_model.py requires the input of job_id, is it possible for me to test this modified version on my local computer to ensure there are no bugs before submitting it to Snellius? I'm not sure how to solve this in the development phase.

Please let me know if any information need be provided. I would greatly appreciate it if you could share your experience and provide guidance on how to address these questions.

Best regards, Zengjing

SarahAlidoost commented 1 year ago

@Crystal-szj nice job in creating the issue :+1: , thanks. Here answers:

You dont need a new function STEMMUS_SCOPE_SS_exe.m that accepts two input variables. Instead, you can implement it as below:
- [x] add the path of parameter_setting_file to your config_file; For example at the end of the file, add a new line ParameterSettingsPath = ../../O2_para_lists/para_value_SS001.xlsx.
- [x] change the io.read_config(CFG) to get the path, for example [DataPaths, forcingFileName, numberOfTimeSteps, startDate, endDate, gsOption, phsOption, RunningMessages, ParameterSettingsPath] = io.read_config(CFG);.
- [x] then use the variable in your script, e.g. para_sens = readtable(ParameterSettingsPath);

So, you need to move things around in STEMMUS_SCOPE_SS.m. If the path to parameter_setting_file changes everytime you run the model, in the run_model.py file, write a function that reads the config_file and writes it again with a new path to parameter_setting_file.

SarahAlidoost commented 1 year ago

the variable job_id is only used to write the log file. You can create a run_model_local.py file and removejob_id. Then use run_model_local.py locally.

SarahAlidoost commented 1 year ago

Just to give you ideas about reading and writing config_file in python, here are some examples read_config and update_config. Use them as an example, you need to write your own functions.

Crystal-szj commented 1 year ago

the variable job_id is only used to write the log file. You can create a run_model_local.py file and removejob_id. Then use run_model_local.py locally.

@SarahAlidoost Hi, Sarah, many thanks for your suggestions. I commented out the job_id part and Argparse part, and tried running run_model_local.py on my computer in Pycharm. I'm testing with a test file at AR-SLu, but I get an error when I run this line here. The exit_code is 1 instead of 0 or 139, see here. I copied the error message here:

D:\software\Anaconda3\envs\pystemmusscope\python.exe F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\xarray\core\accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False) Traceback (most recent call last): File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 93, in run_model_local(0) File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 40, in run_model_local model_log = model.run() File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 206, in run result = _run_sub_process(args, None) File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 85, in _run_sub_process raise subprocess.CalledProcessError( subprocess.CalledProcessError: Command '['F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\exe\STEMMUS_SCOPE F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\input\AR-SLu_2023-06-26-1225\AR-SLu_2023-06-26-1225_config.txt']' returned non-zero exit status 1.

Process finished with exit code 1

Could you please help me to figure out what's wrong here? Please let me know if more information is needed. Thanks very much.

SarahAlidoost commented 1 year ago

@Crystal-szj there are several things to check:

the version of pystemmusscope and stemmus_scope, see here.
if you are running exe file with matlab runtime, you might need to set LD_LIBRARY_PATH, see the documentation
model.setup() genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any.
check if generated exe file works.

Crystal-szj commented 1 year ago

@SarahAlidoost Many thanks for your advice.

About the version of pystemmusscope. I use the latest version of pystemmusscope. The version is 0.3.0.
I set the LD_LIBRARY_PATH accordingly. However, I'm running and debugging the Python code run_model.py and currently not use the Matlab runtime.
After running model.setup(), yes, it created an input directory including but without the .nc file. I use the config_file it generated in the input directory, and netcdf file in InputPath, it works.
About the generated exe file: In my case, I create an executable file named STEMMUS_SCOPE_SS.exe, and the config file named config_file_snellius_sensitivity_analysis. I can run the exe file via python console
```
import subprocess
subprocess.run(['.\exe\STEMMUS_SCOPE_SS.exe','.\config_file_snellius_sensitivity_analysis.txt'])
```
or WSL terminal
```
./exe/STEMMUS_SCOPE_SS.exe ./config_file_snellius_sensitivity_analysis.txt
```
Both of above commands work well. I think the exe file works.

However, when I run model.run(), the program break and doesn't execute continuously.

Crystal-szj commented 1 year ago

The above problem may cause by the different operating systems (e.g. Linux and Windows). The documentation works well on the Linux system, but failed on WSL see here. In addition, the executable file generated by different systems may not be compatible. It's better to regenerate the executable file when run it on a new system.

Crystal-szj commented 1 year ago

@Crystal-szj there are several things to check:

the version of pystemmusscope and stemmus_scope, see here.

if you are running exe file with matlab runtime, you might need to set LD_LIBRARY_PATH, see the documentation

model.setup() genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any.

check if generated exe file works.

@SarahAlidoost Hi Sarah, many thanks for your advice. I installed a Linux system, and now the code works well. However, when I did the test run, I encountered the same issue with Qianqian about allocating one core per task. We discussed it together, but it's still a challenge for us to find a solution. I wondered if you encountered a similar situation in your experience running the 170 sites, and if you could share any insights or suggestions you may have.

All the codes have been uploaded to EcoExtreML/STEMMUS_SCOPE_sensitivity_analysis repository. Here is some detailed information.

To submit the task to Snellius, I used the run_stemmus_scope_snellius.sh. In this shell script, a python function named run_model_on_snellius_sensitivity_analysis.py was called to execute the MATLAB executable file named STEMMUS_SCOPE_SS. For the test run, I have limited it to only 480 timesteps (instead of the complete study period of 10608 timesteps) to access CPU performance.
To monitor the CPU usage, I used squeue to obtain the node_id information and then accessed the node using ssh node_id. After that, I used the command htop -u <user name> to gather the following information.
Here is the log file

Please let me know if you need further information. Any insights or suggestions you can provide would be immensely helpful. Sincerely thanks for your time and support.

SarahAlidoost commented 1 year ago

To submit the task to Snellius, I used the run_stemmus_scope_snellius.sh.

I see that you commented out the loop for. Also, the variables ncores, i, and k are not used in your code. The loop is exactly the place where parallel execution is implemented. I am not sure if you saw the surf documentation that I have already sent to Qianqian, here are the links https://servicedesk.surf.nl/wiki/display/WIKI/Methods+of+parallelization https://servicedesk.surf.nl/wiki/display/WIKI/Example+job+scripts#Examplejobscripts-Singlenode,concurrentprogramsonthesamenode(CPUandGPU)

Crystal-szj commented 1 year ago

I see that you commented out the loop for. Also, the variables ncores, i, and k are not used in your code. The loop is exactly the place where parallel execution is implemented.

Thanks for your prompt response and links. I understand your approach, where each site is assigned to a separate core for parallel execution. That enables the completion of 170 sites in six rounds, with 32 sites processed per round.

However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

I noticed from the above screenshot that multiple cores were active, even though I just submitted only one task. Does this indicate the presence of parallelism within the executable file? My question is whether I should ensure "one task one CPU" or whether I can overlook this issue and proceed with using the loop for to run the 380 cases.

Thanks again for your guidance and expertise.

SarahAlidoost commented 1 year ago

According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

No, this is not the case except we tell the computer to run ten tasks on ten cores. It means that we should implement a method of parallelization, e.g. the loop for with the parameter wait and & . You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.

SarahAlidoost commented 1 year ago

However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

your code is different than Qianqian's code and does not use many Python libraries. If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument -R singleCompThread is set.

Crystal-szj commented 1 year ago

@SarahAlidoost Hi Sarah, many thanks for your reply.

You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.

About the core usage in parallel computing, I uncommented the loop for in run_stemmus_scope_snellius.sh and performed the test run using 1,2, and 4 cases. Sometimes the CPU usage per core exceeds 100%, and two cores are activated for each single case. It's worth noting that I have set the argument "-R singleCompThread" when building the executable file see here. I provided detailed information for each of the test runs:

Test run with 1 case: this shell script. When I used htop -u <username> to check the CPU performance, two cores are activated (with one running and one sleeping, see the value of column "S")
Test run with 2 cases: this shell script, but it throw an error: I added sleep 90 to solve this problem and ran it again see here. When it running, 4 cores were activated with 2 running and 2 sleeping
Test run with 4 cases: This test run involving for cases submitted via the script. There were 8 cores activated.

I would like to inquire whether this occasional CPU usage exceeding 100% and two cores activated for one case are common situations on a supercomputer?

In addition, I found the values of cores per node in the slurm_log file were changed for the same task even though I didn't change any setting in run_stemmus_scope_snellius.sh. Additionally, the terminal displayed a message stating "You will be charged for 0.25 core". However, upon checking the slurm_{jobid}.out file, I found that the cores per node values varied. For example, I submitted the same job twice, but the slurm{job_id}.out file show different cores per node, Job Wall-clock time, CPU utilization, CPU efficiency, Memory utilization, and Memory efficiency between the job executions.

I'm seeking your advice on any additional steps or considerations that should be taken before executing the 380 cases.Thanks again for your help and time.

Crystal-szj commented 1 year ago

If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument -R singleCompThread is set.

The STEMMUS_SCOPE version I used is based on version 1.1.9. And I added the plant hydraulics part as a separate function. I'd like to clarify that I have not utilized any parallel computer packages such as parfor within my function. The execution is currently running in a sequential manner.

If you have any further questions or require more details, please let me know. Thanks for your support.