ComputationalThermodynamics / MAGEMin

The parallel Mineral Assemblage Gibbs Energy Minimization package
GNU General Public License v3.0
71 stars 17 forks source link

Parallel computation slow on apple silicon mac #50

Closed eazzzon closed 1 year ago

eazzzon commented 1 year ago

Hi

I am trying to run the parallel computation with the matlab interface but runs very slow, actually one core model runs even faster.

I followed this issue, and have complied with homebrew for NLopt, MPICH and LAPACK, then makefile but got an error below:

image

Not very familiar with makefiles, any idea of how to make this work? This might be a beginner's issue..Thanks in advance!

boriskaus commented 1 year ago

Yes, I also noticed a few days ago that it is indeed running slow on the newer apple silicon systems, if using the Julia BinaryBuilder version. It runs very fast if you compile it manually/locally, which suggests that there is a problem with the BB version. I'll see if I can find what the issue is (may take a bit, depending on my time).

Local compilation

On a a mac, the easiest way to install the required binaries nlopt,mpich,lapack is to compile them through homebrew. First install homebrew, followed by:

$ brew install nlopt
$ brew install mpich
$ brew install lapack

Next, you will have to adapt the makefile. We just updated that to link to the default hombrew directories, but from what you show above it seems you are using an older version. Have a look at the latest version here. The makefile is a textfile, so you can comment the lines you don't need and try this again.

eazzzon commented 1 year ago

Hi,

Thanks, the new makefile works.

image

But matlab interface seems only work on refinement level 1 (very fast calculation speed) then stops with error. Any idea how to fix the mpiexec path not found error?

image

PS: I think I am running with the local executable as shown here. The default version is still super slow (which I believe is what you mentioned in this issue

image
NicolasRiel commented 1 year ago

Hi,

Here the problem is that the calculation is not performed because the path to mpiexec is likely wrong. As you can see on your screenshot: "/user/bin//mpiexec" Try to change the path to /usr/bin and not /usr/bin/

Hope this helps!

boriskaus commented 1 year ago

Homebrew installs mpiexec in:

/opt/homebrew/bin

so try that. Indeed, the default version has a problem on Apple Silicon at the moment. I only have a silicon machine since a few days, so I am hopeful it will be resolved at some stage.

eazzzon commented 1 year ago

Homebrew installs mpiexec in:

/opt/homebrew/bin

so try that. Indeed, the default version has a problem on Apple Silicon at the moment. I only have a silicon machine since a few days, so I am hopeful it will be resolved at some stage.

This perfectly solves the issue.

change to /usr/bin didn't work, I guess it doesn't find mpiexec which is under homebrew

Thanks a lot for helping!

boriskaus commented 1 year ago

I leave it open until we resolve the issue with the BinaryBuilder version of MAGEMin being slow

eazzzon commented 1 year ago

A bit feedback, I am curious if the default binarybuilder being slow is because of julia? I did a loop with julia interface and it turns out takes 0.5 - 1s for one point and occasionally 2s. Could also because I didn't probably loop it wisely...

boriskaus commented 1 year ago

no, I think it has to do with how the binaries are compiled; it's certainly not a Julia issue. On different architectures (linux, apple intel) it works much faster. The Julia interface uses the same BinaryBuilder version as the 'default' option in the MATLAB GUI; it is therefore not a surprise that it is slow as well.

eazzzon commented 1 year ago

Hi

the new updated 1.3.0 seems has this issue back. but different error:

MAGEMin   1.3.0 [06/03/2023]

zsh:1: no matches found: _pseudosection_output.*.*
/opt/homebrew/bin/mpiexec -n 6 ./MAGEMin --out_matlab=0 --solver=1 --Verb=0 --sys_in=mol --db=ig --File=MAGEMin_input.dat --n_points=49 --test=0

command =

    'export PATH=/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/bin:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/bin ;  export DYLD_LIBRARY_PATH=/Users/easonzz/.julia/artifacts/900c5f3ba53bb0d128142a78da39027c65597b0f/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/lib/julia:/Users/easonzz/.julia/artifacts/bf797a6e6d1fcc01635d6b2723ac0390c82f41d2/lib:/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/lib:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib/julia:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib; /opt/homebrew/bin/mpiexec -n 6 ./MAGEMin --out_matlab=0 --solver=1 --Verb=0 --sys_in=mol --db=ig --File=MAGEMin_input.dat --n_points=49 --test=0'

No matching processes belonging to you were found

ans =

     1

--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

  plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[MBAEZ.local:90369] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(335)
export PATH=/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/bin:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/bin ;  export DYLD_LIBRARY_PATH=/Users/easonzz/.julia/artifacts/900c5f3ba53bb0d128142a78da39027c65597b0f/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/lib/julia:/Users/easonzz/.julia/artifacts/bf797a6e6d1fcc01635d6b2723ac0390c82f41d2/lib:/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/lib:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib/julia:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib; /opt/homebrew/bin/mpiexec -n 6 ./MAGEMin --out_matlab=0 --solver=1 --Verb=0 --sys_in=mol --db=ig --File=MAGEMin_input.dat --n_points=49 --test=0: Signal 115

ForwardSimulation_Time =

    0.2515

Error using sscanf
First argument must be a text scalar.

Error in ReadPseudoSectionData_MAGEMin (line 34)
    A       = sscanf(line,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f');

Error in PerformMAGEMin_Simulation (line 100)
            [PhaseData, Status] = ReadPseudoSectionData_MAGEMin(newPoints, PhaseData,
            Computation.MinPhaseFraction);

Error in ComputePhaseDiagrams_AMR (line 123)
    [PhaseData, TP_vec, FailedSimulations, CancelComputation] =
    PerformMAGEMin_Simulation(PhaseData, newPoints, TP_vec, VerboseLevel, Chemistry, dlg,
    ComputeAllPoints, UseGammaEstimation, Computation);

Error in PlotPseudosection/StartNewComputation (line 1533)
            [PseudoSectionData, CancelComputation]       =
            ComputePhaseDiagrams_AMR(PseudoSectionData, DisplayPlots);

Error using matlab.ui.control.internal.controller.ComponentController/executeUserCallback (line 386)
Error while evaluating Button PrivateButtonPushedFcn.

The new self-complied 1.3.0 works

image

Below is my GUI settings for local paraller calculation: image

default path /usr/bin also didn't work

Any idea what might cause the issue?

NicolasRiel commented 1 year ago

It seems to me that it is trying to run with the binary builder version while you are giving a local path for mpi. If I recall installing MAGEMin with the GUI creates an environment variable file, the conflict may come from there. How did you install the last version?

Did you compile MAGEMin yourself? Or did you install it with the binary builder?

eazzzon commented 1 year ago

Hi,

Both previous version (1.2.8) and 1.3.0 I installed with the MATLAB GUI, yes there is an environmental variable .m file created after the installation. I then complied MAGEMin myself to enable a local paraller model. it worked with 1.2.8 but doesn't work with 1.3.0.

here are what in my environmental variable file if that is useful?


path_dylib = '/Users/easonzz/.julia/artifacts/900c5f3ba53bb0d128142a78da39027c65597b0f/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/lib/julia:/Users/easonzz/.julia/artifacts/bf797a6e6d1fcc01635d6b2723ac0390c82f41d2/lib:/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/lib:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib/julia:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib'; 
path_bin = '/Users/easonzz/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/bin:/Users/easonzz/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/bin'; 
path_julia = '/Applications/Julia-1.8.app/Contents/Resources/julia/bin'; ```
NicolasRiel commented 1 year ago

What happens if you try to run that command in a terminal from the MAGEMin directory (make sure that the MAGEMin_input.dat exists, if not try first generating with the GUI until it crashes):

/opt/homebrew/bin/mpiexec -n 6 ./MAGEMin --out_matlab=0 --solver=1 --Verb=0 --sys_in=mol --db=ig --File=MAGEMin_input.dat --n_points=49 --test=0

eazzzon commented 1 year ago

Hi, it works well in this way with output.

image

Looks like this is a miscommunicate with MATLAB GUI and the mpi?

On a related note, is there a way to save the results from the terminal like this? I guess I would need the GUI to generate a dat file first?

NicolasRiel commented 1 year ago

So the problem is that the GUI is loading the environment variables. Try to delete the .m file then, and the GUI should work without problems with the local MPI.

eazzzon commented 1 year ago

amazing, fixed!

So the problem is that the GUI is loading the environment variables. Try to delete the .m file then, and the GUI should work without problems with the local MPI.

boriskaus commented 1 year ago

ok, I am happy to report that we finally fixed the issue on Apple Silicon with the automatically installed MAGEMin version (in version 1.3.1). So you no longer need to compile the code yourself (make sure you do this with File > Install MAGEMin in the GUI).

It runs at essentially the same speed a when compiling this manually: Before:

julia> using MAGEMin_jll

julia> run(`$(MAGEMin_jll.MAGEMin())`)

Running MAGEMin 1.2.7 [22/09/2022] on 1 cores {
═══════════════════════════════════════════════
 Status             :            0 
 Mass residual      : +7.90944e-06
 Rank               :            0 
 Point              :            0 
 Temperature        :  +1100.00000       [C] 
 Pressure           :    +12.00000       [kbar]

 SOL = [G: -825.338] (35 iterations, 2109.93 ms)
 GAM = [-1011.909272,-1829.092209,-819.265216,-695.468666,-412.938858,-971.870791,-876.535530,-1073.647034,-276.622011,-1380.309499]

 Phase :      opx      spn       ol      cpx 
 Mode  :  0.23186  0.01393  0.60213  0.15208 
___________________________________
MAGEMin comp time: +2305.751000 ms }

After:

julia> using MAGEMin_jll

julia> run(`$(MAGEMin_jll.MAGEMin())`)

Running MAGEMin 1.3.1 [03/04/2023] on 1 cores {
═══════════════════════════════════════════════
 Status             :            0 
 Mass residual      : +5.13017e-06
 Rank               :            0 
 Point              :            0 
 Temperature        :  +1100.00000       [C] 
 Pressure           :    +12.00000       [kbar]

 SOL = [G: -825.337] (34 iterations, 38.21 ms)
 GAM = [-1011.909615,-1829.092317,-819.264025,-695.467466,-412.947646,-971.889493,-876.545698,-1073.639033,-276.591254,-1380.299192]

 Phase :      opx      cpx       ol      spn 
 Mode  :  0.23189  0.15205  0.60213  0.01393 
___________________________________
MAGEMin comp time: +42.925000 ms }

Issue If you are interested in what happened: the issue had to do with how LAPACK/BLAS was linked where the multithreading seemed to have interfered with the MPI build). We solved this by changing MAGEMin to use the Apple Accelerate framework (which includes optimised versions of LAPACK), rather than relying on our own compiled versions. This removed one external dependency and should also take care of future hardware improvements (as long as apple adapts their libraries accordingly).

Profiling the code This was discovered while profiling the code on an Apple Silicon machine with XCode and the command-line tools installed. For completion, here the steps done to do this:

  1. Run MAGEMin for 100 points (any input file will do). This example is for the manually compiled MAGEMin version:

    $ xcrun xctrace record --template "Time Profiler" --launch  /Users/kausb/WORK/MAGEMin/MAGEMin -- --File=/Users/kausb/WORK/MAGEMin/MAGEMin_input.dat --n_points=100
    Starting recording with the Time Profiler template. Launching process: MAGEMin. 
    Ctrl-C to stop the recording
    Target app exited, ending recording...
    Recording completed. Saving output file...
    Output file saved as: Launch_MAGEMin_2023-04-04_10.55.04_4C23A589.trace

    If you want to do the same with the BinaryBuilder version of MAGEMin, you need to add the correct dynamic libraries as well:

    $xcrun xctrace record --template "Time Profiler" -e DYLD_FALLBACK_LIBRARY_PATH=/Users/kausb/.julia/artifacts/900c5f3ba53bb0d128142a78da39027c65597b0f/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/lib/julia:/Users/kausb/.julia/artifacts/bf797a6e6d1fcc01635d6b2723ac0390c82f41d2/lib:/Users/kausb/.julia/artifacts/abb7cbd1c6369f566bf0334f8e033f35b639d0e6/lib:/Users/kausb/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/lib:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib/julia:/Applications/Julia-1.8.app/Contents/Resources/julia/bin/../lib:/Users/kausb/lib:/usr/local/lib:/lib:/usr/lib --launch /Users/kausb/.julia/artifacts/5ead90ea92128f3bba70df07a389c372594e09db/bin/MAGEMin -- --File=/Users/kausb/WORK/MAGEMin/MAGEMin_input.dat --n_points=100
  2. Open the trace file:

    $ open  Launch_MAGEMin_2023-04-04_10.55.04_4C23A589.trace

    This will open the Instruments app and will allow you to see where the time is spend:

Screenshot 2023-04-04 at 10 57 45

for the current version of MAGEMin, 59% of the time is spend in NLopt routines.

eazzzon commented 1 year ago

thank you Boris, it works great now