Stefan-Endres commented 6 years ago

Overview

There are many sub-routines in shgo that are low hanging fruit for parallelization, most importantly the sampling mapping of the objective function itself (in cases where it is possible to parallelize the problem). In addition we can optimize the sampling size of an iteration based on the computational resources available.

We need to be careful with both dependencies and code structure changes we include for several reasons. First our dependencies on scipy.optimize.minimize and scipy.spatial.Delaunay. Secondly our ambition to include shgo in scipy.optimize means it should ideally have the same dependencies and structure. Finally we want to minimize reliance on maintenance from other packages which can lead to such issues that we had with using multiprocessing_on_dill in tgo.

My suggestion is to use numba to avoid needing to change the code structure at all. We can do this using tricks such as the one used in poliastro: https://github.com/poliastro/poliastro/blob/0.6.x/src/poliastro/jit.py https://github.com/poliastro/poliastro/blob/master/setup.py#L40

which simply maps the decorator to the dependency if it is installed or does nothing. Obviously it is possible that we can use the same tricks for other libraries and methods by redefining range functions etc in our code.

While it is possible that SciPy will eventually include numba as a dependency, based on discussions in the scipy-dev mailing lists this will not happen in the near future: https://mail.python.org/pipermail/scipy-dev/2018-March/022576.html

However, for now we should be able to maintain numba as an optional dependency as describe in the rest of this post. My idea is to provide two main modes of parallelization, based on both CPU and GPU parallelization. So therefore using numba for GPU parallelization would be ideal since it avoids extra dependencies. Finally numba provides us with access to LLVM that can be used in the sampling generation.

GPU

User architecture can be found or specified and then we can map it to our own decorator.

Nvidia

We can use @numba.cuda.jit, I propose an early test of the objective function so that the user can be warned if this fails. https://devblogs.nvidia.com/seven-things-numba/ https://numba.pydata.org/numba-doc/latest/cuda/index.html

AMD

We can use @hsa.jit(device=True) https://numba.pydata.org/numba-doc/latest/hsa/overview.html

CPU

Our options for CPU parallelization http://numba.pydata.org/numba-doc/dev/user/parallel.html or multiprocessing_on_dill etc.

However, it appears that parallelization with numba isn't as simple as just adding @jit(parallel = True) https://stackoverflow.com/questions/45610292/how-to-make-numba-jit-use-all-cpu-cores-parallelize-numba-jit

So we should also do a few tests on non-trivial functions to see if it is worth implementing.

microprediction commented 3 years ago

Not sure if the status has changed a lot already for this topic, but I have a pretty expensive function (2-10 minutes) so I'd be interested in helping devise something.

fcela commented 3 years ago

Same here. My main interest are computational and algorithmic changes to better address situations where (1) the objective function is very expensive, but heavily vectorized [i.e. the cost of evaluating a single point is very similar to the cost of evaluating a large set of points at the same time]; and (2) we have large numbers of computing nodes at our disposal that can work in parallel.

microprediction commented 3 years ago

I wonder if the pattern I use here (for optuna) might also work for SHGO? The idea is that SGHO only sees an objective function and doesn’t need to worry about how it gets computed.

https://github.com/microprediction/embarrassingly

From: fcela notifications@github.com Reply-To: Stefan-Endres/shgo reply@reply.github.com Date: Monday, October 26, 2020 at 10:15 AM To: Stefan-Endres/shgo shgo@noreply.github.com Cc: Peter Cotton PCotton@intechinvestments.com, Comment comment@noreply.github.com Subject: [EXT] Re: [Stefan-Endres/shgo] Parallelization (#17)

Same here. My main interest are computational and algorithmic changes to better address situations where (1) the objective function is very expensive, but heavily vectorized [i.e. the cost of evaluating a single point is very similar to the cost of evaluating a large set of points at the same time]; and (2) we have large numbers of computing nodes at our disposal that can work in parallel. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

This electronic message, including its attachments, contains information from Janus Henderson Investors. Janus Henderson Investors is the name under which various entities, including Janus Capital Management LLC, Perkins Investment Management LLC and Intech Investment Management LLC provide investment products and services. All of these companies are wholly owned subsidiaries of Janus Henderson Group plc (incorporated in Jersey, registered no.101484, registered office 47 Esplanade, St Helier, Jersey JE1 0BD). This email contains information which is confidential and may be privileged and attorney work product, intended solely for the use of the individual or entity named above. Your personal information will be kept in accordance with the applicable data privacy laws and Janus Henderson’s Privacy Policy. A copy of the document is available under the Privacy Policy section of our website at www.janushenderson.com and in hard copy by sending a request to privacy@janushenderson.com If you are not the intended recipient, be aware that you must not read this email and that any disclosure, copying, distribution or use of the contents of this email is prohibited. If you have received this email in error, please immediately notify the sender and delete the original email and its attachments without reading or saving in any manner. .

Stefan-Endres commented 3 years ago

Hi @microprediction @fcela.

In the most recent update (7e83bb8) I've added the workers argument for shgo to allow for basic parallelization.

I would greatly appreciate any feedback and/or error reports using the argument. Currently I have only tested it with very simple Python objective functions. I suspect there might be issues such as pickling errors with more complex functions. Since all the unittests are passing I have also uploaded it to PyPi for a more convenient install, but would I like to test the implementation more before expanding the code and the documentation for downstream repositories.

Minimum working example:

from shgo import shgo
import time

# Toy problem
def f(x):
    time.sleep(0.1)
    return x[0] ** 2 + x[1] ** 2

bounds = np.array([[0, 1],]*2)

ts = time.time()
res = shgo(f, bounds, n=50, iters=2)
print(f'Total time serial: {time.time()- ts}')
print('-')
print(f'res = {res}')
ts = time.time()
res = shgo(f, bounds, n=50, iters=2, workers=8)
print('=')
print(f'Total time par: {time.time()- ts}')
print('-')
print(f'res = {res}')

CLI output:

Total time serial: 10.341249465942383
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 103
     nit: 2
   nlfev: 3
   nlhev: 0
   nljev: 1
 success: True
    tnev: 103
       x: array([0., 0.])
      xl: array([[0., 0.]])
=
Total time par: 1.9465992450714111
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 103
     nit: 2
   nlfev: 3
   nlhev: 0
   nljev: 1
 success: True
    tnev: 103
       x: array([0., 0.])
      xl: array([[0., 0.]])

Relevant code snippet (uses the multiprocessing library):

https://github.com/Stefan-Endres/shgo/blob/7e83bb8291a3420ff1f8c665647af005e568e229/shgo/_shgo_lib/_vertex.py#L436-L449

The parallelization occurs while evaluating the functions during the sampling stage. During the local minimization step serial evaluations are still used. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies.

microprediction commented 3 years ago

That’s super helpful. Will try to get to it today.

From: Stefan Endres notifications@github.com Reply-To: Stefan-Endres/shgo reply@reply.github.com Date: Tuesday, October 27, 2020 at 12:34 PM To: Stefan-Endres/shgo shgo@noreply.github.com Cc: Peter Cotton PCotton@intechinvestments.com, Mention mention@noreply.github.com Subject: [EXT] Re: [Stefan-Endres/shgo] Parallelization (#17)

Hi @microprediction @fcela. In the most recent update (7e83bb8) I've added the workers argument for shgo to allow for basic parallelization. I would greatly appreciate any feedback and/or error reports using the argument. Currently I have only tested it with very simple Python objective functions. I suspect there might be issues such as pickling errors with more complex functions. Since all the unittests are passing I have also uploaded it to PyPi for a more convenient install, but would I like to test the implementation more before expanding the code and the documentation for downstream repositories. Minimum working example:

from shgo import shgo import time

Toy problem

def f(x): time.sleep(0.1) return x[0] 2 + x[1] 2 bounds = np.array([[0, 1],]*2) ts = time.time() res = shgo(f, bounds, n=50, iters=2) print(f'Total time serial: {time.time()- ts}') print('-') print(f'res = {res}') ts = time.time() res = shgo(f, bounds, n=50, iters=2, workers=8) print('=') print(f'Total time par: {time.time()- ts}') print('-') print(f'res = {res}')

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]]) Relevant code snippet (uses the multiprocessing library): https://github.com/Stefan-Endres/shgo/blob/7e83bb8291a3420ff1f8c665647af005e568e229/shgo/_shgo_lib/_vertex.py#L436-L449 The parallelization occurs while evaluating the functions during the sampling stage. During the local minimization step serial evaluations are still used. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

This electronic message, including its attachments, contains information from Janus Henderson Investors. Janus Henderson Investors is the name under which various entities, including Janus Capital Management LLC, Perkins Investment Management LLC and Intech Investment Management LLC provide investment products and services. All of these companies are wholly owned subsidiaries of Janus Henderson Group plc (incorporated in Jersey, registered no.101484, registered office 47 Esplanade, St Helier, Jersey JE1 0BD). This email contains information which is confidential and may be privileged and attorney work product, intended solely for the use of the individual or entity named above. Your personal information will be kept in accordance with the applicable data privacy laws and Janus Henderson’s Privacy Policy. A copy of the document is available under the Privacy Policy section of our website at www.janushenderson.com and in hard copy by sending a request to privacy@janushenderson.com If you are not the intended recipient, be aware that you must not read this email and that any disclosure, copying, distribution or use of the contents of this email is prohibited. If you have received this email in error, please immediately notify the sender and delete the original email and its attachments without reading or saving in any manner. .

fcela commented 3 years ago

Since you are using multiprocessing, let me see if I can get it to work on Ray. The new wrapper for multiprocessing in Ray looks verypromising, and if it works well, that may be all that is needed to scale up multi node.

https://docs.ray.io/en/master/multiprocessing.html

From: Peter Cotton notifications@github.com Sent: Tuesday, October 27, 2020, 12:35 To: Stefan-Endres/shgo Cc: fcela; Mention Subject: Re: [Stefan-Endres/shgo] Parallelization (#17)

That’s super helpful. Will try to get to it today.

From: Stefan Endres notifications@github.com Reply-To: Stefan-Endres/shgo reply@reply.github.com Date: Tuesday, October 27, 2020 at 12:34 PM To: Stefan-Endres/shgo shgo@noreply.github.com Cc: Peter Cotton PCotton@intechinvestments.com, Mention mention@noreply.github.com Subject: [EXT] Re: [Stefan-Endres/shgo] Parallelization (#17)

Hi @microprediction @fcela. In the most recent update (7e83bb8) I've added the workers argument for shgo to allow for basic parallelization. I would greatly appreciate any feedback and/or error reports using the argument. Currently I have only tested it with very simple Python objective functions. I suspect there might be issues such as pickling errors with more complex functions. Since all the unittests are passing I have also uploaded it to PyPi for a more convenient install, but would I like to test the implementation more before expanding the code and the documentation for downstream repositories. Minimum working example:

from shgo import shgo import time

Toy problem

def f(x): time.sleep(0.1) return x[0] 2 + x[1] 2

bounds = np.array([[0, 1],]*2)

ts = time.time() res = shgo(f, bounds, n=50, iters=2) print(f'Total time serial: {time.time()- ts}') print('-') print(f'res = {res}') ts = time.time() res = shgo(f, bounds, n=50, iters=2, workers=8) print('=') print(f'Total time par: {time.time()- ts}') print('-') print(f'res = {res}')

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Relevant code snippet (uses the multiprocessing library): https://github.com/Stefan-Endres/shgo/blob/7e83bb8291a3420ff1f8c665647af005e568e229/shgo/_shgo_lib/_vertex.py#L436-L449 The parallelization occurs while evaluating the functions during the sampling stage. During the local minimization step serial evaluations are still used. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

This electronic message, including its attachments, contains information from Janus Henderson Investors. Janus Henderson Investors is the name under which various entities, including Janus Capital Management LLC, Perkins Investment Management LLC and Intech Investment Management LLC provide investment products and services. All of these companies are wholly owned subsidiaries of Janus Henderson Group plc (incorporated in Jersey, registered no.101484, registered office 47 Esplanade, St Helier, Jersey JE1 0BD). This email contains information which is confidential and may be privileged and attorney work product, intended solely for the use of the individual or entity named above. Your personal information will be kept in accordance with the applicable data privacy laws and Janus Henderson’s Privacy Policy. A copy of the document is available under the Privacy Policy section of our website at www.janushenderson.com and in hard copy by sending a request to privacy@janushenderson.com If you are not the intended recipient, be aware that you must not read this email and that any disclosure, copying, distribution or use of the contents of this email is prohibited. If you have received this email in error, please immediately notify the sender and delete the original email and its attachments without reading or saving in any manner. .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Stefan-Endres/shgo/issues/17#issuecomment-717369197, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH4TCYVEHRHME3AJBWBQGTSM3ZF5ANCNFSM4EWVYVFA.

microprediction commented 3 years ago

Nice. Sorry I haven’t tested yet. I got derailed by Monday night football, as you can see here: https://www.microprediction.com/blog/nine

From: fcela notifications@github.com Reply-To: Stefan-Endres/shgo reply@reply.github.com Date: Tuesday, October 27, 2020 at 10:26 PM To: Stefan-Endres/shgo shgo@noreply.github.com Cc: Peter Cotton PCotton@intechinvestments.com, Mention mention@noreply.github.com Subject: [EXT] Re: [Stefan-Endres/shgo] Parallelization (#17)

Since you are using multiprocessing, let me see if I can get it to work on Ray. The new wrapper for multiprocessing in Ray looks verypromising, and if it works well, that may be all that is needed to scale up multi node.

https://docs.ray.io/en/master/multiprocessing.html

From: Peter Cotton notifications@github.com Sent: Tuesday, October 27, 2020, 12:35 To: Stefan-Endres/shgo Cc: fcela; Mention Subject: Re: [Stefan-Endres/shgo] Parallelization (#17)

That’s super helpful. Will try to get to it today.

From: Stefan Endres notifications@github.com Reply-To: Stefan-Endres/shgo reply@reply.github.com Date: Tuesday, October 27, 2020 at 12:34 PM To: Stefan-Endres/shgo shgo@noreply.github.com Cc: Peter Cotton PCotton@intechinvestments.com, Mention mention@noreply.github.com Subject: [EXT] Re: [Stefan-Endres/shgo] Parallelization (#17)

Hi @microprediction @fcela. In the most recent update (7e83bb8) I've added the workers argument for shgo to allow for basic parallelization. I would greatly appreciate any feedback and/or error reports using the argument. Currently I have only tested it with very simple Python objective functions. I suspect there might be issues such as pickling errors with more complex functions. Since all the unittests are passing I have also uploaded it to PyPi for a more convenient install, but would I like to test the implementation more before expanding the code and the documentation for downstream repositories. Minimum working example:

from shgo import shgo import time

Toy problem

def f(x): time.sleep(0.1) return x[0] 2 + x[1] 2

bounds = np.array([[0, 1],]*2)

ts = time.time() res = shgo(f, bounds, n=50, iters=2) print(f'Total time serial: {time.time()- ts}') print('-') print(f'res = {res}') ts = time.time() res = shgo(f, bounds, n=50, iters=2, workers=8) print('=') print(f'Total time par: {time.time()- ts}') print('-') print(f'res = {res}')

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Relevant code snippet (uses the multiprocessing library): https://github.com/Stefan-Endres/shgo/blob/7e83bb8291a3420ff1f8c665647af005e568e229/shgo/_shgo_lib/_vertex.py#L436-L449 The parallelization occurs while evaluating the functions during the sampling stage. During the local minimization step serial evaluations are still used. In the future I would like to add parallelization here that provides each core with a starting point plus the chosen local minimisation function, ideally only using the standard library and scipy dependencies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

This electronic message, including its attachments, contains information from Janus Henderson Investors. Janus Henderson Investors is the name under which various entities, including Janus Capital Management LLC, Perkins Investment Management LLC and Intech Investment Management LLC provide investment products and services. All of these companies are wholly owned subsidiaries of Janus Henderson Group plc (incorporated in Jersey, registered no.101484, registered office 47 Esplanade, St Helier, Jersey JE1 0BD). This email contains information which is confidential and may be privileged and attorney work product, intended solely for the use of the individual or entity named above. Your personal information will be kept in accordance with the applicable data privacy laws and Janus Henderson’s Privacy Policy. A copy of the document is available under the Privacy Policy section of our website at www.janushenderson.com and in hard copy by sending a request to privacy@janushenderson.com If you are not the intended recipient, be aware that you must not read this email and that any disclosure, copying, distribution or use of the contents of this email is prohibited. If you have received this email in error, please immediately notify the sender and delete the original email and its attachments without reading or saving in any manner. .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Stefan-Endres/shgo/issues/17#issuecomment-717369197, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAH4TCYVEHRHME3AJBWBQGTSM3ZF5ANCNFSM4EWVYVFA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fcela commented 3 years ago

Ray parallelization appears to work without any problem -- just replaing import multiprocessing as mp with import ray.util.multiprocessing as mp in shgo/_shgo_lib/_vertex.py

This is what I get for the minimal example above, multiprocessing vs ray.

Multiprocessing

Total time serial: 10.44602346420288
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])
=
Total time par: 2.0377724170684814
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])

Ray

Total time serial: 10.438774108886719
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])
2020-11-02 14:55:04,351 INFO services.py:1166 -- View the Ray dashboard at http://127.0.0.1:8265
=
Total time par: 3.972384452819824
-
res =      fun: 0.0
    funl: array([0.])
 message: 'Optimization terminated successfully.'
    nfev: 104
     nit: 2
   nlfev: 4
   nlhev: 0
   nljev: 1
 success: True
    tnev: 104
       x: array([0., 0.])
      xl: array([[0., 0.]])

Of course for a problem this small, ray's parallelization overhead is overkill.

microprediction commented 3 years ago

I created this example https://github.com/microprediction/humpday/blob/main/Embarrassingly_SHGO.ipynb

Seems smooth enough, but still a toy example I suppose.

Taking a look at your code now to see what your concern might be re: pickle, but at least for my use case the objective function (or "pre-objective" as I've called it) is probably going to ssh off somewhere and shell out.

Perhaps...

my case is too simple, since I am content to have my "delegator processes" on a single machine, and they boss around processes elsewhere.
I should/could change embarrassingly to use ray
I'm missing something

Stefan-Endres / shgo

Parallelization #17

Overview

GPU

Nvidia

AMD

CPU

Toy problem

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111

Toy problem

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111

Toy problem

CLI output: Total time serial: 10.341249465942383

res = fun: 0.0 funl: array([0.]) message: 'Optimization terminated successfully.' nfev: 103 nit: 2 nlfev: 3 nlhev: 0 nljev: 1 success: True tnev: 103 x: array([0., 0.]) xl: array([[0., 0.]])

Total time par: 1.9465992450714111