esa / pygmo2

A Python platform to perform parallel computations of optimisation tasks (global and local) via the asynchronous generalized island model.
https://esa.github.io/pygmo2/
Mozilla Public License 2.0
434 stars 57 forks source link

Feature/scipywrapper #31

Closed mlooz closed 4 years ago

mlooz commented 4 years ago

A wrapper for scipy.optimize.minimize. Available as user-defined algorithm (UDA) pygmo.scipy if scipy is installed.

Options are either passed to the algorithm constructor or extracted from the problem.

Constraints are not (yet) supported, as only COBYLA, SLSQP and trust-constr in scipy support them anyway, but they require different formats and SLSQP is already available in NLOpt. In case of overwhelming demand I can add them later.

mlooz commented 4 years ago

Thanks a lot for these very thorough and detailed comments! I'll work through them and update the PR.

bluescarni commented 4 years ago

@mlooz some of the build failures are unrelated to this PR and fixed in #34. I should be able to complete it soon, then you can merge from master back into this PR.

mlooz commented 4 years ago

@bluescarni Thanks, I'll do it as soon as #34 is merged.

mlooz commented 4 years ago

@bluescarni Even after I've merged the master branch including #34 , all the CircleCI tests fail because the archipelago test case hangs. The changes regarding the scipy wrapper are pretty much independent from touching the archipelago. Can you take a look at his or help me debug it?

bluescarni commented 4 years ago

@bluescarni Even after I've merged the master branch including #34 , all the CircleCI tests fail because the archipelago test case hangs. The changes regarding the scipy wrapper are pretty much independent from touching the archipelago. Can you take a look at his or help me debug it?

I am totally stumped. This behaviour is really bizarre, first because it fails on linux but works on OSX and Windows (usually it's the contrary) and then because the linux builds on travis are actually working fine. It's only the linux builds on circleci that fail, and they all consistently always fail.

I just tried on my system as well and it works (Gentoo Linux).

I noticed a couple of oddities while running the scipywrapper tests here:

So perhaps there's something going on with excessive process creation which is not handled well by the virtual machine environment on CircleCI? It's pretty far fetched, but this is all I have at this time.

Could you perhaps try to disable the scipy tests altogether and see what happens?

mlooz commented 4 years ago

Thanks for looking into it!

* the CPU utilization is strangely low (10% or so measured with `top`),

* it seems like there are a lot of new `python` processes being created while the tests are being run (I realised this while monitoring the process IDs with `top`, it seems like circa 10 python processes per second are being created while the scipy tests are being run).

This is indeed very strange! I didn't expect any of this to happen, will try to reproduce. It must be scipy itself spawning these, but 10 per second are way too many.

So perhaps there's something going on with excessive process creation which is not handled well by the virtual machine environment on CircleCI? It's pretty far fetched, but this is all I have at this time.

Must be, but what is the cause? Iit seems unrelated to the scipy wrapper, since the archipelago test hangs before the scipy tests are even executed. How did you fix the original bug?

Could you perhaps try to disable the scipy tests altogether and see what happens?

Will try.

darioizzo commented 4 years ago

Whats the status for this?

mlooz commented 4 years ago

Currently depends on fixing the archipelago hang, which I don't know how to do quickly. I thus postponed it to look into it after the PPSN deadline.

mlooz commented 4 years ago

I've now instantiated the circleCI image locally on my machine, there the tests pass. This keeps getting stranger.

bluescarni commented 4 years ago

I've now instantiated the circleCI image locally on my machine, there the tests pass. This keeps getting stranger.

Just to be clear, when you say that the tests pass, are you referring to the latest commit or to an earlier one where the scipywrapper was still being imported?

mlooz commented 4 years ago

I've now instantiated the circleCI image locally on my machine, there the tests pass. This keeps getting stranger.

Just to be clear, when you say that the tests pass, are you referring to the latest commit or to an earlier one where the scipywrapper was still being imported?

An earlier one including the scipywrapper and also its unit tests.

bluescarni commented 4 years ago

I've now instantiated the circleCI image locally on my machine, there the tests pass. This keeps getting stranger.

Just to be clear, when you say that the tests pass, are you referring to the latest commit or to an earlier one where the scipywrapper was still being imported?

An earlier one including the scipywrapper and also its unit tests.

Ok.

So if I am reading the CI results correctly, it seems like the mere act of importing the wrapper is what causes the hang right? The latest commit with the import removed now passes on circleci.

mlooz commented 4 years ago

I've now instantiated the circleCI image locally on my machine, there the tests pass. This keeps getting stranger.

Just to be clear, when you say that the tests pass, are you referring to the latest commit or to an earlier one where the scipywrapper was still being imported?

An earlier one including the scipywrapper and also its unit tests.

Ok.

So if I am reading the CI results correctly, it seems like the mere act of importing the wrapper is what causes the hang right? The latest commit with the import removed now passes on circleci.

Yes, that's what it looks like. To be sure, I've pushed another commit where I just re-added the import and nothing else.

bluescarni commented 4 years ago

On your local setup with CircleCI, are you using conda packages for the dependencies?

mlooz commented 4 years ago

On your local setup with CircleCI, are you using conda packages for the dependencies?

My workflow is this: $ docker run -t -i --mount type=bind,source="$(pwd)",target=/app circleci/buildpack-deps:bionic bash $ cd app $ sudo bash tools/circleci_bionic_gcc7_pagmo_head_38.sh

This script clones pagmo, uses pygmo from the local directory and uses conda for everything else

bluescarni commented 4 years ago

On your local setup with CircleCI, are you using conda packages for the dependencies?

My workflow is this: $ docker run -t -i --mount type=bind,source="$(pwd)",target=/app circleci/buildpack-deps:bionic bash $ cd app $ sudo bash tools/circleci_bionic_gcc7_pagmo_head_38.sh

This script clones pagmo, uses pygmo from the local directory and uses conda for everything else

Ok, sounds reasonable.

The fact that on travis+linux, where we use pip for the dependencies, everything works ok, made me think that perhaps the issue is with the linux conda packages we use on circleci. But that does not seem to be the case, since you are also using conda packages.

bluescarni commented 4 years ago

This is a long shot, but can you maybe try to rename the wrapper from scipy to something else, e.g., scipy_optimize or similar?

We'll have to do this anyway at one point because it name-clashes with the scipy module itself, but I was wondering if this might be related to the hang.

mlooz commented 4 years ago

This is a long shot, but can you maybe try to rename the wrapper from scipy to something else, e.g., scipy_optimize or similar?

We'll have to do this anyway at one point because it name-clashes with the scipy module itself, but I was wondering if this might be related to the hang.

The same thought had crossed my mind, too. I don't think it is related to a name collision, since the hang also occurs if the real scipy package is never imported. Will try it next.

mlooz commented 4 years ago

It looks like an interaction with numba is causing the hang. In commits e5d06b7 and f6a5219 I renamed the typing import and disabled the numba import and most of the CircleCI runs pass. (One still fails due to wrong dependency configs.)

In commits 855d74b and 3ff22c5, the only thing I changed was re-enabling numba and the hang reappears. Numba is used to translate two python functions into C++, but these functions are then never executed.

bluescarni commented 4 years ago

It looks like an interaction with numba is causing the hang. In commits e5d06b7 and f6a5219 I renamed the typing import and disabled the numba import and most of the CircleCI runs pass. (One still fails due to wrong dependency configs.)

In commits 855d74b and 3ff22c5, the only thing I changed was re-enabling numba and the hang reappears. Numba is used to translate two python functions into C++, but these functions are then never executed.

I am a bit confused... does numba come into play during the import phase of the module? Or is it imported only if the scipywrapper is actually used (as in, constructed and invoked)?

bluescarni commented 4 years ago

It looks like an interaction with numba is causing the hang. In commits e5d06b7 and f6a5219 I renamed the typing import and disabled the numba import and most of the CircleCI runs pass. (One still fails due to wrong dependency configs.) In commits 855d74b and 3ff22c5, the only thing I changed was re-enabling numba and the hang reappears. Numba is used to translate two python functions into C++, but these functions are then never executed.

I am a bit confused... does numba come into play during the import phase of the module? Or is it imported only if the scipywrapper is actually used (as in, constructed and invoked)?

Ahhh maybe I get it now... it's implicitly imported when Python parses the @maybe_jit decorator...

bluescarni commented 4 years ago

@mlooz This looks suspiciously like #42. Can you try to adopt the workarounds suggested there? Specifically, apply the jit decorator to functions defined locally within the methods, so that we avoid messing around with numba during import.

mlooz commented 4 years ago

@mlooz This looks suspiciously like #42. Can you try to adopt the workarounds suggested there? Specifically, apply the jit decorator to functions defined locally within the methods, so that we avoid messing around with numba during import.

Yes, that works! Finally! Thank you! :)

mlooz commented 4 years ago

I'm now happy with the current state. Constraint values are cached properly, selection policies are possible and test coverage is at 95%. Ready for review.

bluescarni commented 4 years ago

@mlooz did you have a chance to look into the performance impact of the jit decorator being used from within the method?

mlooz commented 4 years ago

@mlooz did you have a chance to look into the performance impact of the jit decorator being used from within the method?

Defining the decorated functions within other functions is much slower than not using numba at all. Benchmarking the version where the decorated functions are defined on the class level requires a bit of refactoring, I'm doing that now.

These numbers are from simple benchmarks similar to the ones used in the unit tests:

Times for numba defined in method, in seconds: Simple Test with gradient: 5.902128219604492 Hessian and Hessian sparsity: 6.287413835525513 Constraints with Hessians: 3.191514730453491

Times for no numba, in seconds: Simple Test with gradient: 3.376042366027832 Hessian and Hessian sparsity: 1.3210361003875732 Constraints with Hessians: 0.476820707321167

mlooz commented 4 years ago

Times for decorated functions on class level: Simple Test with gradient: 3.644361972808838 Hessian and Hessian sparsity: 1.9270591735839844 Constraints with Hessians: 0.32841968536376953

So: Using numba-decorated functions defined on the class level is also slightly slower than not using numba, except in cases with many Hessians.

These numbers are preliminary, of course, since they were just one run each. The penalty for defining functions within other functions is pretty striking, though. If we cannot get the archipelago working with globally decorated numba functions, it would be better to disable numba entirely in the scipy wrapper.

bluescarni commented 4 years ago

These numbers are preliminary, of course, since they were just one run each. The penalty for defining functions within other functions is pretty striking, though. If we cannot get the archipelago working with globally decorated numba functions, it would be better to disable numba entirely in the scipy wrapper.

If I understand correctly, these are not averages over multiple runs then?

Regardless of whether it is used from the class scope or within the methods, the numba JIT compiler has a very large overhead for the first function call, but after that it should be much faster (with the caveat that we don't know yet if defining the jitted functions within other methods triggers recompilation each time the method is invoked).

mlooz commented 4 years ago

These numbers are preliminary, of course, since they were just one run each. The penalty for defining functions within other functions is pretty striking, though. If we cannot get the archipelago working with globally decorated numba functions, it would be better to disable numba entirely in the scipy wrapper.

If I understand correctly, these are not averages over multiple runs then?

Not averages, but summed over several methods/problems. Four scipy methods on the Rosenbrock problem for the first test, three methods on two methods each for the second, two methods on Hock_Schittkowsky_71 for the third. All problems were of dimension 10.

Regardless of whether it is used from the class scope or within the methods, the numba JIT compiler has a very large overhead for the first function call, but after that it should be much faster (with the caveat that we don't know yet if defining the jitted functions within other methods triggers recompilation each time the method is invoked).

Yes, the overhead is definitely noticeable. For numba to be useful, the decorated functions would need to be called much more often than they are compiled. Maybe the times on the third test of the class-level benchmark were better because the compiled version was already in memory?

bluescarni commented 4 years ago

I went ahead and wrote a little benchmark:

from numba import jit, float64
import numpy as np

# Random array,
arr = np.random.rand(1000)

# No numba.
def test_no_numba(a):
        ret = 0.
        for i in range(len(a)):
                ret += a[i]
        return ret

# Numba on the function definition.
@jit(float64(float64[:]), nopython=True)
def test_numba1(a):
        ret = 0.
        for i in range(len(a)):
                ret += a[i]
        return ret

# Numba inside the function definition.
def test_numba2(a):
        @jit(float64(float64[:]), nopython=True)
        def test_numba_in(a):
                ret = 0.
                for i in range(len(a)):
                        ret += a[i]
                return ret

        return test_numba_in(a)

if __name__ == "__main__":
        import time

        # Run all 3 one time for warmup.
        test_no_numba(arr)
        test_numba1(arr)
        test_numba2(arr)

        # Performance runs.
        start = time.perf_counter()
        res=test_no_numba(arr)
        print("no numba: res={}, time={}".format(res, time.perf_counter()-start))

        start = time.perf_counter()
        res=test_numba1(arr)
        print("numba1: res={}, time={}".format(res, time.perf_counter()-start))

        start = time.perf_counter()
        res=test_numba2(arr)
        print("numba2: res={}, time={}".format(res, time.perf_counter()-start))

Perhaps you can run it locally to confirm the findings... At any rate, here it prints:

no numba: res=510.2145532921285, time=0.00023823699666536413
numba1: res=510.2145532921285, time=2.1800005924887955e-06
numba2: res=510.2145532921285, time=0.03908383299858542

Thus, numba directly on the function definition is about 2 orders of magnitude faster than vanilla python (for this stupid benchmark anyway). It's pretty clear that numba re-compiles a function that was jitted within another function body each time the outer function is invoked, which is consistent with your findings above.

So this is a problem for us, because numba on the outside incurs in the serialization issues, but numba on the inside is pathetically slow. At this point we can either dig in and find a way to work around the issue (e.g., delve into numba looking for the right option to delay the jitting, or perhaps use a cache that manages the jitted functions within the scipy wrapper), or we can forget about numba altogether.

bluescarni commented 4 years ago

@mlooz actually it looks like if jit is used without the function signature, then the compilation is deferred until the first use of the function:

http://numba.pydata.org/numba-doc/0.17.0/reference/compilation.html#numba.jit

When the builds where hanging, were you using the jit decorator with or without function signature?

mlooz commented 4 years ago

@mlooz actually it looks like if jit is used without the function signature, then the compilation is deferred until the first use of the function:

http://numba.pydata.org/numba-doc/0.17.0/reference/compilation.html#numba.jit

When the builds where hanging, were you using the jit decorator with or without function signature?

Without signature, like it is right now: jit(nopython=True)(func) Removing the nopython flag also did not prevent the hangs.

mlooz commented 4 years ago

I went ahead and wrote a little benchmark:

Numbers look equivalent on my machine: no numba: res=522.6711784969735, time=0.0012607330008904682 numba1: res=522.6711784969735, time=6.082000254536979e-06 numba2: res=522.6711784969735, time=0.2916524690008373

mlooz commented 4 years ago

Using the cache=True flag is faster, but still with significant overhead:

`

Numba inside the function definition, but with caching

    def test_numba3(a):
            @jit(float64(float64[:]), nopython=True, cache=True)
            def test_numba_in2(a):
                    ret = 0.
                    for i in range(len(a)):
                            ret += a[i]
                    return ret

            return test_numba_in2(a)

`

no numba: res=512.7943961246789, time=0.0015854929988563526 numba1: res=512.7943961246789, time=7.84700023359619e-06 numba2: res=512.7943961246789, time=0.2593506810007966 numba3: res=512.7943961246789, time=0.0039038379982230254

bluescarni commented 4 years ago

Using the cache=True flag is faster, but still with significant overhead

I guess that even with the cache, it still has to parse the body of the inside function every time the outer function is invoked :(

At this point I'd be inclined to drop the numba support altogether.

@mlooz I saw another comment of yours earlier, but it seems like it's gone now. You were talking about the fact that even with the decorator outside the function, the use of numba might not be particularly beneficial in the case of scipy_wrapper because the jit overhead dominates. But perhaps then I am not understanding the logic inside the wrapper, because I thought that the functions translating hessians/gradients between pagmo's and scipy's formats would be called many times during an evolve()?

mlooz commented 4 years ago

I guess that even with the cache, it still has to parse the body of the inside function every time the outer function is invoked :(

The cache is on disk, so maybe even I/O overhead. I sure hope it doesn't read the inner function from disk every time the outer function is called.

At this point I'd be inclined to drop the numba support altogether.

I agree. I think that in general, figuring out how to combine numba with multiprocessing in an efficient way would be very useful, but this is a larger issue.

@mlooz I saw another comment of yours earlier, but it seems like it's gone now. You were talking about the fact that even with the decorator outside the function, the use of numba might not be particularly beneficial in the case of scipy_wrapper because the jit overhead dominates. But perhaps then I am not understanding the logic inside the wrapper, because I thought that the functions translating hessians/gradients between pagmo's and scipy's formats would be called many times during an evolve()?

I had deleted that comment, because I wasn't sure about my argument anymore and wanted to come back when I could support it with numbers. Hoped that you hadn't seen it, but too late. :-) Unpacking the hessians happens once per call to problem.hessians() and unpacking the gradients once per gradient evaluation times the fitness dimension. My impression is that these unpacking steps take less than 10-20% in the total optimization workflow, because even if they are called often, everything else also is. Optimizing these steps could thus gain at most 10-20% improvement.

The reason why I'm not completely sure about this line of reasoning is that I had so far measured these things only on toy problems of dimension 10, which are solved with <100 evaluations of the hessian. Things could look different on long problems.

bluescarni commented 4 years ago

@mlooz further reports from #42 indicate that changing the serialization backend may fix the numba issue. Would you mind trying this out?

mlooz commented 4 years ago

@mlooz further reports from #42 indicate that changing the serialization backend may fix the numba issue. Would you mind trying this out?

Currently trying it out. I can't use set_serialization_backend since it is defined in init.py and thus not yet available in _py_algorithms.py. Changing the default in init.py causes other test failures, since many parallel things cannot be pickled. For example:

E AttributeError: Can't pickle local object 'mp_island_test_case.run_basic_tests.<locals>.<lambda>'

mlooz commented 4 years ago

Unfortunately, numba needs pickle to work correctly, while some features in the archipelago need cloudpickle. I currently see no way to combine these.

bluescarni commented 4 years ago

@mlooz in order for the tests to pass, we could switch temporarily the serialization backend within the test suite. But that is not a good solution in any case I think, because the users will have to be aware that they need to fiddle with the serialization backend if they want to use the wrapper in a multi-processing environment.

Perhaps one possible solution would be to implement manually the pickle protocol within the scipy wrapper. That is, instead of letting cloudpickle figure out how to serialize the UDA on its own, we explicitly implement the __getstate__()/__setstate__() methods in the wrapper, so that cloudpickle would end up invoking those instead of doing whatever is causing the pickling issue.

However, at this point I am also ok with removing the numba support altogether and revisiting this at a later stage, since I imagine this might be getting pretty frustrating.

mlooz commented 4 years ago

I'm reading up on cloudpickle and finding an alarming absence of documentation. As long as Dario is still busy with the code review, I'll play around with the pickling and try to find a workaround. If I don't find anything quick, I'm in favor of merging this PR without numba and revisiting it later.

Have a good weekend! There is a public holiday in the Netherlands on Tuesday and I've taken Monday off, so talk to you on Wednesday!

darioizzo commented 4 years ago

:) I agree @mlooz will quickly review this and we can proceed without numba after the review and the CI are green ! Have a nice holiday!

bluescarni commented 4 years ago

I'm reading up on cloudpickle and finding an alarming absence of documentation. As long as Dario is still busy with the code review, I'll play around with the pickling and try to find a workaround. If I don't find anything quick, I'm in favor of merging this PR without numba and revisiting it later.

We rely on cloudpickle because the standard pickle module cannot serialize functions/classes defined in an interactive Python session, which is a problem for interactive uses of pygmo. Goes to show how much multiprocessing can be more problematic than multithreading :(

Have a good weekend! There is a public holiday in the Netherlands on Tuesday and I've taken Monday off, so talk to you on Wednesday!

Have a nice holiday and thanks for persevering :)

darioizzo commented 4 years ago

Whats the decision for jit on this PR? I thought we decided to get rid of it as via the workaround found computations are slowed down instead of accelerated? Or maybe I lost some messages?

mlooz commented 4 years ago

Whats the decision for jit on this PR? I thought we decided to get rid of it as via the workaround found computations are slowed down instead of accelerated? Or maybe I lost some messages?

I discussed some ideas with Francesco just now that solve another issue with numba, but not this one. As you see in 6fa1438, I've removed the numba decorators. I won't change anything on this PR anymore.

darioizzo commented 4 years ago

Ok, then for me this is good to go. I will merge it when the CI completes.