Open Crystal-szj opened 1 year ago
@Crystal-szj nice job in creating the issue :+1: , thanks. Here answers:
STEMMUS_SCOPE_SS_exe.m
that accepts two input variables. Instead, you can implement it as below:
parameter_setting_file
to your config_file
; For example at the end of the file, add a new line ParameterSettingsPath = ../../O2_para_lists/para_value_SS001.xlsx
. io.read_config(CFG)
to get the path, for example [DataPaths, forcingFileName, numberOfTimeSteps, startDate, endDate, gsOption, phsOption, RunningMessages, ParameterSettingsPath] = io.read_config(CFG);
. para_sens = readtable(ParameterSettingsPath);
So, you need to move things around in STEMMUS_SCOPE_SS.m
. If the path to parameter_setting_file
changes everytime you run the model, in the run_model.py
file, write a function that reads the config_file
and writes it again with a new path to parameter_setting_file
.
job_id
is only used to write the log file. You can create a run_model_local.py
file and removejob_id
. Then use run_model_local.py
locally. Just to give you ideas about reading and writing config_file in python, here are some examples read_config and update_config. Use them as an example, you need to write your own functions.
- the variable
job_id
is only used to write the log file. You can create arun_model_local.py
file and removejob_id
. Then userun_model_local.py
locally.
@SarahAlidoost Hi, Sarah, many thanks for your suggestions. I commented out the job_id
part and Argparse
part, and tried running run_model_local.py
on my computer in Pycharm. I'm testing with a test file at AR-SLu, but I get an error when I run this line here. The exit_code
is 1 instead of 0 or 139, see here. I copied the error message here:
D:\software\Anaconda3\envs\pystemmusscope\python.exe F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\xarray\core\accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self. values_as_series = pd.Series(values.ravel(), copy=False) Traceback (most recent call last): File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 93, in
run_model_local(0) File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 40, in run_model_local model_log = model.run() File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 206, in run result = _run_sub_process(args, None) File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 85, in _run_sub_process raise subprocess.CalledProcessError( subprocess.CalledProcessError: Command '['F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\exe\STEMMUS_SCOPE F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\input\AR-SLu_2023-06-26-1225\AR-SLu_2023-06-26-1225_config.txt']' returned non-zero exit status 1. Process finished with exit code 1
Could you please help me to figure out what's wrong here? Please let me know if more information is needed. Thanks very much.
@Crystal-szj there are several things to check:
LD_LIBRARY_PATH
, see the documentationmodel.setup()
genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any. @SarahAlidoost Many thanks for your advice.
pystemmusscope
. The version is 0.3.0.LD_LIBRARY_PATH
accordingly. However, I'm running and debugging the Python code run_model.py
and currently not use the Matlab runtime.model.setup()
, yes, it created an input directory including but without the .nc
file. I use the config_file
it generated in the input directory, and netcdf file in InputPath
, it works.STEMMUS_SCOPE_SS.exe
, and the config file named config_file_snellius_sensitivity_analysis
.
I can run the exe file via python console
import subprocess
subprocess.run(['.\exe\STEMMUS_SCOPE_SS.exe','.\config_file_snellius_sensitivity_analysis.txt'])
or WSL terminal
./exe/STEMMUS_SCOPE_SS.exe ./config_file_snellius_sensitivity_analysis.txt
Both of above commands work well. I think the exe file works.
However, when I run model.run()
, the program break and doesn't execute continuously.
The above problem may cause by the different operating systems (e.g. Linux and Windows). The documentation works well on the Linux system, but failed on WSL see here. In addition, the executable file generated by different systems may not be compatible. It's better to regenerate the executable file when run it on a new system.
@Crystal-szj there are several things to check:
- the version of pystemmusscope and stemmus_scope, see here.
- if you are running exe file with matlab runtime, you might need to set
LD_LIBRARY_PATH
, see the documentationmodel.setup()
genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any.- check if generated exe file works.
@SarahAlidoost Hi Sarah, many thanks for your advice. I installed a Linux system, and now the code works well. However, when I did the test run, I encountered the same issue with Qianqian about allocating one core per task. We discussed it together, but it's still a challenge for us to find a solution. I wondered if you encountered a similar situation in your experience running the 170 sites, and if you could share any insights or suggestions you may have.
All the codes have been uploaded to EcoExtreML/STEMMUS_SCOPE_sensitivity_analysis repository. Here is some detailed information.
run_stemmus_scope_snellius.sh
. In this shell script, a python function named run_model_on_snellius_sensitivity_analysis.py
was called to execute the MATLAB executable file named STEMMUS_SCOPE_SS
. For the test run, I have limited it to only 480 timesteps (instead of the complete study period of 10608 timesteps) to access CPU performance.squeue
to obtain the node_id information and then accessed the node using ssh node_id
. After that, I used the command htop -u <user name>
to gather the following information.
Please let me know if you need further information. Any insights or suggestions you can provide would be immensely helpful. Sincerely thanks for your time and support.
- To submit the task to Snellius, I used the
run_stemmus_scope_snellius.sh
.
I see that you commented out the loop for
. Also, the variables ncores
, i
, and k
are not used in your code. The loop is exactly the place where parallel execution is implemented. I am not sure if you saw the surf documentation that I have already sent to Qianqian, here are the links
https://servicedesk.surf.nl/wiki/display/WIKI/Methods+of+parallelization
https://servicedesk.surf.nl/wiki/display/WIKI/Example+job+scripts#Examplejobscripts-Singlenode,concurrentprogramsonthesamenode(CPUandGPU)
I see that you commented out the
loop for
. Also, the variablesncores
,i
, andk
are not used in your code. The loop is exactly the place where parallel execution is implemented.
Thanks for your prompt response and links. I understand your approach, where each site is assigned to a separate core for parallel execution. That enables the completion of 170 sites in six rounds, with 32 sites processed per round.
However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.
I noticed from the above screenshot that multiple cores were active, even though I just submitted only one task. Does this indicate the presence of parallelism within the executable file? My question is whether I should ensure "one task one CPU" or whether I can overlook this issue and proceed with using the loop for to run the 380 cases.
Thanks again for your guidance and expertise.
According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.
No, this is not the case except we tell the computer to run ten tasks on ten cores. It means that we should implement a method of parallelization, e.g. the loop for with the parameter wait
and &
. You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.
However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.
your code is different than Qianqian's code and does not use many Python libraries. If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument -R singleCompThread
is set.
@SarahAlidoost Hi Sarah, many thanks for your reply.
You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.
loop for
in run_stemmus_scope_snellius.sh
and performed the test run using 1,2, and 4 cases. Sometimes the CPU usage per core exceeds 100%, and two cores are activated for each single case. It's worth noting that I have set the argument "-R singleCompThread" when building the executable file see here. I provided detailed information for each of the test runs:Test run with 1 case: this shell script.
When I used htop -u <username>
to check the CPU performance, two cores are activated (with one running and one sleeping, see the value of column "S")
Test run with 2 cases: this shell script, but it throw an error:
I added sleep 90
to solve this problem and ran it again see here. When it running, 4 cores were activated with 2 running and 2 sleeping
Test run with 4 cases: This test run involving for cases submitted via the script. There were 8 cores activated.
I would like to inquire whether this occasional CPU usage exceeding 100% and two cores activated for one case are common situations on a supercomputer?
cores per node
in the slurm_log file were changed for the same task even though I didn't change any setting in run_stemmus_scope_snellius.sh
. Additionally, the terminal displayed a message stating "You will be charged for 0.25 core". However, upon checking the slurm_{jobid}.out file, I found that the cores per node values varied. For example, I submitted the same job twice, but the slurm{job_id}.out file show different cores per node
, Job Wall-clock time
, CPU utilization
, CPU efficiency
, Memory utilization
, and Memory efficiency
between the job executions.
I'm seeking your advice on any additional steps or considerations that should be taken before executing the 380 cases.Thanks again for your help and time.
If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument
-R singleCompThread
is set.
The STEMMUS_SCOPE version I used is based on version 1.1.9. And I added the plant hydraulics part as a separate function. I'd like to clarify that I have not utilized any parallel computer packages such as parfor
within my function. The execution is currently running in a sequential manner.
If you have any further questions or require more details, please let me know. Thanks for your support.
@SarahAlidoost Hi Sarah,
I hope this message finds you well. I want to do a sensitivity analysis on STEMMUS_SCOPE by setting different sets of parameters (run the model 380 times). I would like to utilize parallel computing to finish this part.
To begin, I have created a new executable file via STEMMUS_SCOPE_SS_exe.m that requires two input parameters: one for the config file (
config_file
) and the other for the parameters (parameter_setting_file
). The Matlab code portion has been completed.My intention is to utilize the existing _run_STEMMUS_SCOPE_inSnellius framework. If I understand correctly, I need to modify the
run_model.py
file to iterate through the input parameter file instead of the input forcing data for 170 sites.The pystemmusscope environment is activated. Now I have a couple of questions:
run_model.py
file, see here, we need to create an instance of the model. However, theparameter_file
is not an input for the StemmusScope class. Does this mean I need to modify the StemmusScope class in 'pystemmusscope' package and reinstall the package?run_model.py
requires the input ofjob_id
, is it possible for me to test this modified version on my local computer to ensure there are no bugs before submitting it to Snellius? I'm not sure how to solve this in the development phase.Please let me know if any information need be provided. I would greatly appreciate it if you could share your experience and provide guidance on how to address these questions.
Best regards, Zengjing