OPM / LBPM

Pore scale modelling
https://lbpm-sim.org/
GNU General Public License v3.0
69 stars 31 forks source link

Overloading issue #74

Closed yning2 closed 1 year ago

yning2 commented 1 year ago

Dear developer,

I have run into an issue where my job was overloading all pf the allocated compute nodes. It seemed that each process to support the 150 ranks had 10 threads, which leads to the overload. I am using the CPU version and the simulation seemed to produce reasonable results in timelog.csv though overloading. Below is my input file. Greatly appreciated if you can help to figure out why.

Domain { Filename = "../LBPM_input.bin" ReadType = "8bit" // data type nproc = 5, 6, 5 // Number of processors (Npx,Npy,Npz) n = 10, 50, 40 // Size of local domain (Nx,Ny,Nz) N = 50, 300, 200 // size of the input image InletLayers = 0, 0, 0 // number of mixing layers at the inlet OutletLayers = 0, 0, 0 // number of mixing layers at the outlet voxel_length = 1.0 // voxel length (in microns) ReadValues = 0, 1, 2 // labels within the original image WriteValues = 0, 1, 2 // associated labels to be used by LBPM BC = 4 // Flux BC } Color { timestepMax = 2000000 // maximum timtestep alpha = 0.005 // controls interfacial tension beta = 0.95; // controls the interface width rhoA = 1.0 // controls the density of fluid A rhoB = 1.0 // controls the density of fluid B tauA = 0.7 // controls the viscosity of fluid A tauB = 0.7 // controls the viscosity of fluid B F = 0, 0, 0 // body force din = 1.0 // inlet density (controls pressure) dout = 1.0 // outlet density (controls pressure) WettingConvention = "SCAL" // convention for sign of wetting affinity ComponentLabels = 0 // image labels for solid voxels ComponentAffinity = -0.9 // controls the wetting affinity for each label Restart = false flux = -100.0 // volumetric flux at the z-inlet in voxels per timestep } Analysis { analysis_interval = 100 // logging interval for timelog.csv subphase_analysis_interval = 1000 // loggging interval for subphase.csv visualization_interval = 1000 // interval to write visualization files restart_interval = 1000000 // interval to write restart file restart_file = "Restart" // base name of restart file } Visualization { format = "hdf5" write_silo = true // write SILO databases with assigned variables save_8bit_raw = true // write labeled 8-bit binary files with phase assignments save_phase_field = true // save phase field within SILO database save_pressure = true // save pressure field within SILO database save_velocity = true // save velocity field within SILO database } FlowAdaptor { }

yning2 commented 1 year ago

I am using intel mpi and here is my run command: mpirun -n 150 lbpm_color_simulator twophase.db

JamesEMcClure commented 1 year ago

Can you share a bit more information about the underlying hardware? (i.e. identifying information for the processor, usually available from /proc/cpuinfo file or similar).

Something like this would most typically occur when you launch MPI with more processes than physical cores (e.g. "hyper-threading"). Hyper-threading is really oriented toward data center applications where individual threads do not fully utilizing the CPU (cloud systems often support large numbers of idle threads). On the other hand, MPI applications tend to fully utilize CPU cores, and hyper-threading can actually degrade the performance because the threads contend for the resources of CPU core.

yning2 commented 1 year ago

Thank you @JamesEMcClure for the reply. Here is the info about the hardware I was running LBPM.

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz Stepping: 7 CPU MHz: 3099.890 CPU max MHz: 3800.0000 CPU min MHz: 1000.0000 BogoMIPS: 4600.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95

JamesEMcClure commented 1 year ago

Based on this, I would recommend that you not run more than 96 processes per physical node, and that you bind processes to physical cores. See this link for information about how to do this for Intel MPI

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-8/process-pinning.html

If you run with more than 96 ranks per node, the performance will probably go down, since multiple processes will start to fight each other for the resources of a single core.

Let me know if this resolves your problem.

yning2 commented 1 year ago

@JamesEMcClure Sorry, I made it confusing with my runscript. I submitted the job to 2 nodes of 96 cores, but code was run on 150 ranks, which left 42 idle.

I think I've found the issue. After I included the variable load_balance = "independent" and also added a flag in my mpirun command, making sure the job is even distributed on those two nodes, the issue seemed resolved.

Thank you for helping investigate the issue.

yning2 commented 1 year ago

issue resolved