ComputationalThermodynamics / MAGEMin

The parallel Mineral Assemblage Gibbs Energy Minimization package
GNU General Public License v3.0
71 stars 17 forks source link

Parallel computation box greyed out in 1.0.6 #14

Closed bobmyhill closed 2 years ago

bobmyhill commented 2 years ago

A couple of plausibly related issues after installing 1.0.6 on a MacBook Pro (M1). Combined here for brevity.

1) The box allowing specification of the path to mpiexec is greyed out:

image

2) Turning off parallel computations allows calculations to run, but every calculation is accompanied by a warning that starts ''Do you want the application “MAGEMin” to accept incoming network connections?''

My machine is behind a firewall (university rules, alas), so I can't accept incoming connections, and I can't turn these messages off (Accept and Deny don't appear to do anything to later calculations).

Any suggestions much appreciated.

boriskaus commented 2 years ago
  1. This is expected behavior. The default version of MAGEMin (as downloaded through Julia), is shipped with MPI and will already do parallel computations. In case you have a multicore processor on your machine (essentially all modern processors have this), you can test this by setting the # of cores to 1,2 and compare the computational time (should be faster). Note that there is a little bit of overhead involved with initialising MPI, so you best observe the effect if you compute a lot of points (say >1000). We have tested this on windows, linux and intel mac's but not yet on the M1 (so would be great if you can confirm that it works as expected). The mpiexec path is something that you only need to set if you compile MAGEMin yourself. Once the MAGEMin exists in the directory, the MAGEMin executable and mpiexec path buttons are no longer greyed out.
  2. Yes, this is indeed related to this being a binary that is not registered in apple's allowed list. If you have administrator rights on your machine, you can likely specify this binary as being safe, but if that is not the case I am not sure what can be done. Note that MAGEMin does not actually require an internet connection or sends/receives anything from the web. It may however be related to MPI sending information around. I had a similar issue with compiled PETSc code on my machine; that appears to have been resolved in more recent PETSc versions, so perhaps we can use the same trick.
bobmyhill commented 2 years ago

Thanks for the information. I'm familiar with MPI, less so with MATLAB. 1) Parallel computations fail with the error message at the end of this post. Sorry I didn't make the error clear in my last message. 2) I set the binary to safe in the firewall settings (which is where the warning told me to go) before raising this issue, but there's no change in behaviour after doing that. I'll chalk it up to Mac weirdness.

In the next few days I'll compile MAGEMin myself and try that version, but for due diligence as one of your reviewers I thought I should have a go with the MATLAB version.

mpiexec -n 8 MAGEMin --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0

command =

    'export PATH=/Users/rm16686/.julia/artifacts/9bfa7faf9a21863f996d8317bd5936e051971bd6/bin:/Users/rm16686/.julia/artifacts/ebaa199abbbd88d81060d398297c1aeb83b4486d/bin ;  export DYLD_LIBRARY_PATH=/Applications/Julia-1.7.app/Contents/Resources/julia/lib/julia:/Users/rm16686/.julia/artifacts/9bfa7faf9a21863f996d8317bd5936e051971bd6/lib:/Users/rm16686/.julia/artifacts/900c5f3ba53bb0d128142a78da39027c65597b0f/lib:/Users/rm16686/.julia/artifacts/bf797a6e6d1fcc01635d6b2723ac0390c82f41d2/lib:/Users/rm16686/.julia/artifacts/ebaa199abbbd88d81060d398297c1aeb83b4486d/lib:/Applications/Julia-1.7.app/Contents/Resources/julia/bin/../lib/julia:/Applications/Julia-1.7.app/Contents/Resources/julia/bin/../lib; mpiexec -n 8 MAGEMin --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0'

No matching processes belonging to you were found

ans =

     1

Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16bc873ac, argv=0x16bc873a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16fb173ac, argv=0x16fb173a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16bb833ac, argv=0x16bb833a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16f6f73ac, argv=0x16f6f73a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16f1633ac, argv=0x16f1633a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16b5cb3ac, argv=0x16b5cb3a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16ae933ac, argv=0x16ae933a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)
Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)..................: MPI_Init(argc=0x16ae4f3ac, argv=0x16ae4f3a0) failed
MPII_Init_thread(209)..............: 
MPID_Init(77)......................: 
init_world(192)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(313).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, V3WV9VFXX4 (errno 0)

ForwardSimulation_Time =

    0.4238

Error using sscanf
First argument must be a text scalar.

Error in ReadData_MAGEMin (line 34)
    A       = sscanf(line,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f');

Error in PerformMAGEMin_Simulation (line 97)
        [PhaseData, Status] = ReadData_MAGEMin(newPoints, PhaseData, Computation.MinPhaseFraction);

Error in ComputePhaseDiagrams_AMR (line 124)
    [PhaseData, TP_vec, FailedSimulations, CancelComputation] = PerformMAGEMin_Simulation(PhaseData, newPoints, TP_vec, VerboseLevel, Chemistry, dlg, ComputeAllPoints, UseGammaEstimation, Computation);

Error in PlotPseudosection/StartNewComputation (line 1422)
            [PseudoSectionData, CancelComputation]       =   ComputePhaseDiagrams_AMR(PseudoSectionData, DisplayPlots);

Error using matlab.ui.control.internal.controller.ComponentController/executeUserCallback (line 427)
Error while evaluating Button PrivateButtonPushedFcn.
NicolasRiel commented 2 years ago

I am not sure about what is happening with the MPI call through Julia here. But concerning the manual installation of MAGEMin, it should not be too hard, at least it is quite straigthforward on Linux. I don't have much experience with Mac, but the more or less tricky part 1 year ago was to get the C version of lapack installed: lapacke. I know that Boris could get lapacke by simply manually installing it from the lapack library available on netlib (http://www.netlib.org/lapack/). Moreover, from what I could read lapacke is now included in the default lapack package available with Brew. So hopefully you can get all the needed libraries only using Brew (namely mpich, NLopt and lapacke). Then you need to link the libraries path correctly. For this an example for a Mac system is given in the Makefile. If you have any problem please come back to us.

boriskaus commented 2 years ago

Thanks for the explanation; We really appreciate your help in checking this and we would certainly like the matlab/Julia based based version to work, as users are likely to prefer that. Unfortunately, I don't have access to a M1 system which makes debugging a bit tricky.

So the mpiexec based code seems to fail for you, but the point-wise calculations work. This suggests that there could be a problem with the way mpiexec & friends are compiled for the apple M1 architecture. You can try to run this directly from the julia console (making sure that you are in the same directory as MAGEMin_input.dat).

We had some discussions about how to combine/call mpi with MAGEMin this last week, which you can read here.

Could you do a few tests, to check this?

  1. First load MAGEMin_jll, which should be available on your system

    julia> using MAGEMin_jll
  2. Next, can you run the point wise calculations on a single CPU?

    julia> run(`$(MAGEMin()) --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0`);

    I suspect that the Mac firewall message will pop up at this stage (I can look into that later). I expect that this should still work.

  3. Next we can try to follow last weeks suggestion:

    julia> const mpirun = if MAGEMin_jll.MPICH_jll.is_available()
    MAGEMin_jll.MPICH_jll.mpiexec()
    elseif MAGEMin_jll.MicrosoftMPI_jll.is_available()
    MAGEMin_jll.MicrosoftMPI_jll.mpiexec()
    else
    nothing
    end

    after which running this in parallel should ideally be possible with:

    julia> run(`$(mpirun) -n 2  $(MAGEMin()) --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0`);

Let us know at which step it errors. From MATLAB, we don't do anything else as load the path to the required dynamic libraries and make a system call to that, so if it works from within Julia it will be possible to get this working from matlab as well.

boriskaus commented 2 years ago

To get back to this issue:

2. I set the binary to safe in the firewall settings (which is where the warning told me to go) before raising this issue, but there's no change in behaviour after doing that. I'll chalk it up to Mac weirdness.

I was able to reproduce this on an intel Mac, and could push a fix for it. It essentially blocks incoming traffic for the MAGEMin binary. The fix is in the file /julia/firewall_macos.jl, which you can run from the terminal with

$julia firewall_macos.jl

Note that you do need to have the sudo password for your machine. If that is not the case, you will have you ask your system administrator for help.

You will need to run this once for every version of MAGEMin (if you update at some stage in the future, this will likely have to be repeated).

boriskaus commented 2 years ago

I rented a virtual M1 system to test this. Lessons learned:

  1. The Julia-downloaded default MAGEMin works on Apple Silicon, but is extremely slow. Almost 1 second per point, which should really be around 100-150ms or so per point (weird as the system should be faster).
  2. I can get the MPI version working as well, but that is even slower (sometimes >60 seconds).
  3. This suggests that there is something wrong with the BinaryBuilder cross-compilation for the Apple M1.
  4. The firewall fix from my last post works.

Next, I followed the apple installation instructions in the documentation which installs NLopt,MPICH and LAPACKE through HomeBrew. That worked & to simplify this I updated the Makefile to include the correct path's for HomeBrew. With this, timings are as expected:

m1@6aa4e15b-9584-41b0-ab59-5a86c2cba2d8 MAGEMin-main % ./MAGEMin 
Running MAGEMin 1.0.6 [18/03/2022] on 1 cores {
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾

VOL_SYS       +1.582647
RHO_SYS       +3253.910314
MASS_RES      +0.000010
Rank          : 0 
Point         : 0 
Temperature   : 1100.0000    [C] 
Pressure      : 12.00    [kbar]

SOLUTION: [G = -825.337] (37 iterations, 51.88 ms)
[-1011.909244,-1829.091667,-819.265693,-695.468293,-412.942263,-971.879610,-876.528222,-1073.651407,-276.626131,-1380.314708,]
 opx     0.23184 
 cpx     0.15210 
 spn     0.01395 
  ol     0.60211 
Point         0
__________________________________
MAGEMin comp time: +61.462000 ms }

In parallel:

$mpiexec -n 8 ./MAGEMin --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0
...
VOL_SYS       +1.664572
RHO_SYS       +3093.711575
MASS_RES      +0.000005
Rank          : 0 
Point         : 648 
Temperature   : 2000.0000    [C] 
Pressure      : 48.00    [kbar]

SOLUTION: [G = -895.546] (87 iterations, 127.23 ms)
[-1090.491405,-2032.852299,-921.753846,-746.640000,-530.391686,-1153.501918,-1014.124017,-1231.255298,-315.646614,-1544.060891,]
 liq     0.99999 
Point         648
__________________________________
MAGEMin comp time: +9957.603000 ms }

Same on 1 core:

$mpiexec -n 1 ./MAGEMin --Verb=0 --File=MAGEMin_input.dat --n_points=650 --test=0
...
__________________________________
MAGEMin comp time: +49174.574000 ms }

So if you have a Mac with Apple Silicon, our current recommendation is to compile MAGEMin manually following the documentation.

bobmyhill commented 2 years ago

Hi @boriskaus

Thanks for looking into this for me. I independently did the same thing as you (in between lectures and practicals) and got similar results both for a single core and multiple cores. The only difference is that I use openmpi, so the includes were a bit different:

LIBS    = -lm -framework Accelerate /opt/homebrew/opt/lapack/lib/liblapacke.dylib /opt/homebrew/opt/nlopt/lib/libnlopt.dylib /opt/homebrew/opt/openmpi/lib/libmpi.dylib  
INC     = -I/opt/homebrew/opt/openmpi/include/ -I/opt/homebrew/opt/lapack/include -I/usr/local/include -I/opt/homebrew/opt/nlopt/include/

MATLAB remained unhappy until I removed the version of MAGEMin in julia, and also removed an old matlab.mat file from the root directory. Everything now appears to work, both from the command line and from MATLAB :)

I shall now play around which what looks like a very impressive solution to an age-old problem! Thanks for your help.

boriskaus commented 2 years ago

MATLAB remained unhappy until I removed the version of MAGEMin in julia

Hmm, if both the Julia version and a locally compiled version are present, the button should be active with which you can switch between the two versions. The Julia version is the default one in that case.

NicolasRiel commented 2 years ago

Thank you for letting us know the LIBS and INC that you used with openmpi. I am adding this to the documentation.