Replace multiprocessing with pathos

dcambie commented 3 years ago

Since Light direction and position have to be passed as callable, the code defining a scene often employs closures. However, those are not picklable[1] and therefore incompatible with multiprocessing. Pathos replaces pickle with dill, which can properly serialize nested functions thus solving this issue.

(Note: cli code not tested)

[1] e.g. resulting in AttributeError: Can't pickle local object 'create_standard_scene..light_position'

danieljfarrell commented 3 years ago

Oh pathos is pretty neat, not heard of that before.

I’ve been working on CLI for the last few months.

It allows you do do things like,

pvtrace-cli simulate —workers 8 scene.yml

as a part of this development I’ve refactored those closures to methods on objects which implement __call__; this gets around the pickling problem you encountered. Closures are still available for backwards compatibility, but sill probably deprecate them at some point.

For example,

https://github.com/danieljfarrell/pvtrace/blob/d1cef68ecf4e66754278c675aab9326ac1ecd500/pvtrace/light/light.py#L123

I’m not opposed to merging this now, but probably when the CLI branch gets merged I’ll revert back to multiprocessing to avoid an extra dependency. What do you think?

Also, need to change the requirements.txt file too if you want this merged. Can’t remember if we have a unit test for multiprocessing? If there is one modifying it to check pathos works would be great (don’t worry if there is not one).

dcambie commented 3 years ago

I see... Passing objects implementing the call dunder method is a nice approach that would indeed fix my pickle problem!

That is probably a better solution than adding another external dependency (pathos) - multiprocessing is probably preferable as part of the standard library?

If you are interest in this anyway I can look more in details (adding test and doing some benchmarking vs. multiprocessing) but otherwise I like your solution :)

danieljfarrell commented 3 years ago

Incidentally, I initially used ProcessPoolExecutor, but found it slower and more buggy than simply creating a Pool of processes. So I’m slightly more cautious than normal when dealing with concurrency in Python.

If you are interest in this anyway I can look more in details (adding test and doing some benchmarking vs. multiprocessing)

I would be interested to know how pathos compares to standard library, maybe it is more efficient at dispatching tasks? If it is better in that sense, than multiprocessing, and stable, it would be worth including it here in the CLI branch and I can test it out more. So yes, if you feel motivated, some benchmarks would be a great place to start.

If the performance is similar to worse we should hold off, until a future release in which we can look at options for improving concurrency. I’m particularly interested in the distributed possibilities pathos may offer; i.e. pvtrace cluster! That would be be a welcome addition which is not possible with the standard library.

danieljfarrell commented 3 years ago

PS. A bit later on, I will push some changes to the CLI branch that’s include docs and YML files, this may make bench marking easier.

danieljfarrell commented 3 years ago

OK, just pushed some updates to the CLI (I've not been pushing incrementally so it's a bit of a dump).

Also, I've been trying to write documentation as I go. You might be able to test things out wit Pathos easily using the CLI, check the docs http://danieljfarrell.github.io/pvtrace/tutorials/lsc_with_luminophore/#full-simulation. This is HTTP, it seems HTTPS is not working yet.

The basic command, as shown in the docs, is,

pvtrace-cli simulate --rays 1000 scene.yml

A hello world scene (save this as scene.yml) with luminescent dye looks like this,

version: "1.0"

nodes:
  world:
    sphere:
      radius: 10.0
      material:
        refractive-index: 1.0

  ball-lens:
    location: [0, 0, 2]
    sphere:
      radius: 1.0
      material:
        refractive-index: 1.5
        components:
          - "my-lumogen-dye"

  green-laser:
    location: [0, 0, 0]
    light:
      mask:
        direction:
          cone:
            half-angle: 22.5

components:
  my-lumogen-dye:
    luminophore:
      hist: true
      absorption:
        coefficient: 5
        spectrum:
          name: lumogen-f-red-305
          range:
            min: 500
            max: 1000
            spacing: 2
      emission:
        quantum-yield: 0.98
        phase-function: "isotropic"
        spectrum:
          name: lumogen-f-red-305
          range:
            min: 500
            max: 1000
            spacing: 2

update forgot to include the component on the ball-lens node.

dcambie commented 3 years ago

Wow, great, I will look into it next week!

A pvtrace cluster would have been great to have this week, we have been performing a large batch of pvtrace analyses over a few PCs to simulate the yearly trend of LSC-PM productivity... Code to be published together with the paper hopefully soon :)

dcambie commented 3 years ago

So, from a limited testing (code used below) it seems that the multiprocessing implementation (i.e. feature/cli branch) is not really accelerating the execution time at all (!) while pathos (this PR) is working as intended with some overhead (up to the 8 logical cores of my PC).

Below, the number of photons per workers has been kept constant in all the simulations. Red line represents no improvement vs. single threaded, and the blue represents an ideal scale-up with no overhead

multiprocessing_20photons_per_worker pathos_20photons_per_worker

from pathlib import Path
from pvtrace.cli.parse import parse

FULL_EXAMPLE_YML = Path("./tests/data/pvtrace-scene-spec.yml")

if __name__ == '__main__':
    import time
    import logging
    logging.disable(logging.CRITICAL)

    PHOTONS_FOR_WORKER = 20
    num_workers = []
    time_for_sim = []

    for workers in range(1, 9):
        num_workers.append(workers)
        num_photons = PHOTONS_FOR_WORKER * workers

        scene = parse(FULL_EXAMPLE_YML)
        # Setup-time takes about 4 seconds. remove its impact by running a short simulation first
        scene.simulate(num_rays=workers, workers=workers)
        start = time.time()
        for _ in range(10):
            scene.simulate(num_rays=num_photons, workers=workers)
        time_for_sim.append((time.time()-start)/10)
        print(f"Time with {workers} workers and {num_photons} photons was {time_for_sim[-1]}")

    from scipy.stats import linregress
    import matplotlib.pyplot as plt
    import numpy as np
    fig, ax = plt.subplots(1, 1)
    ax.set_xlabel("Num workers")
    ax.set_ylabel("Time (s)")
    # ax.set_title("PvTrace with pathos ProcessPool")
    ax.set_title("PvTrace with multiprocess Pool")

    x = np.linspace(1, 8, 100)
    y = time_for_sim[0] * x
    plt.plot(x, y, '-r', label="worst case")

    x2 = np.linspace(1, 8, 100)
    y2 = np.repeat(time_for_sim[0], 100)
    plt.plot(x2, y2, '-g', label="best case")

    plt.scatter(num_workers, time_for_sim, label="simulation data")
    ax.set_xlim([0, 9])
    ax.set_ylim([0, max(time_for_sim)+2])

    slope, intercept, r, _, _ = linregress(num_workers, time_for_sim)
    y_f = list(slope*np.array(num_workers)+intercept)
    plt.plot(num_workers, y_f, '-b', label=f"fitting R^2={r*r:.3f}")
    plt.legend(loc="lower right")

    plt.savefig("results_20photons_per_worker.png")

danieljfarrell commented 3 years ago

This is great! 👍

Probably 20 rays per worker is not enough to gain benefit over starting the subprocesses. The fact that Pathos does show an improvement with small numbers is interesting. I will see how this scales with larger number of rays per worker. But looks very promising for Pathos.

danieljfarrell commented 3 years ago

Thanks again for putting this together. I had a really busy few weeks, but I managed to catch up a bit tonight.

The plot below is generated from a modified version of your script on this testing branch https://github.com/danieljfarrell/pvtrace/tree/feature/mp-testing

I'm plotting throughput speed-up as more and more worker processes are added. Ideal scaling should have a gradient of 1: every time the workers double the throughput doubles (because the total compute time halves). I'm also running for a total of 5000 rays.

This is the plot for standard library multiprocessing,

and for Pathos,

and on the same axes,

So they perform more or less identically.

What I do notice is that the scaling factor is well below 1, so there is some bottleneck somewhere.

MarkusFF commented 3 years ago

AMD Ryzen 5 3600 running at 4.somthing Ghz, 32Gb RAM, 6 core, 12 logical processors, a few minor background apps, win 10, python 3.7.4 (64 bit), pvtrace at e0eb6c2a3857e9a6bf5a42b5fcde37683bee3ff3 mp_check_pathos.zip

danieljfarrell commented 3 years ago

Scaling seems a bit better than on my laptop. Maybe because the machine is much better! But still a bottleneck somewhere.

Thanks!

MarkusFF commented 3 years ago

MacBookPro16,1; 8-Core Intel Core i9; 2.3 GHz; 16 logical processors, 16 Gb RAM, minor BG apps including UI (iStatMenus, some Antivirus), Mac OS 11.1, python 3.7.4 (64), pvtrace at e0eb6c2a3857e9a6bf5a42b5fcde37683bee3ff3 mp_check_pathos.csv.zip

danieljfarrell commented 3 years ago

This is the MacBook Pro

So they are all very similar.

I'm running some more tests which call the actual CLI as a subprocess just to check times that way too.

dcambie commented 3 years ago

I've run your script a few times on different setup. Interestingly, I obtained different results :)

The main reason is that numpy already has some multi-threaded function (and it spawns a number of workers that depends on python version and platform).¹

Once numpy is forced to a single thread² the best performance are obtained. Multiprocessing is always slightly better than pathos (reasonable, as pathos uses multiprocessing as backend).³

With 12 cores and numpy set to 1 thread I got throughput_rays_per_sec around 430 with both Py 3.7.9, Py 3.8.5 and P 3.9.1 ⁴

Default Setting numpy to 1 thread

[1] This is evident by CPU usage that is double or 4 times the expected. How many workers are present depends on numpy compile settings for accelerated algebra libraries (BLAST and friends) and environmental variables. More info with:

import numpy as np
np.show_config()

[2] i.e. by setting the following environmental variable before importing numpy

import os
NUMPY_THREADS = 1
os.environ["MKL_NUM_THREADS"] = str(NUMPY_THREADS)
os.environ["NUMEXPR_NUM_THREADS"] = str(NUMPY_THREADS)
os.environ["OMP_NUM_THREADS"] = str(NUMPY_THREADS)

[3] If pathos.pools.ProcessPool is used, performance are further reduced (roughtly 10%).

[4] For Py>=3.8 the following line is needed to prevent an AttributeError in multiprocessing (see: https://bugs.python.org/issue39414)

atexit.register(pool.close)

peter-spencer commented 3 years ago

It looks like the scaling behaviour is linear but with a saturation effect.

The data being posted above can be fitted well with:

y = 1 / ( 1/x^p + 1/s^p )^(1/p)

Where s is the saturated (maximum) value and p is the power law that describes how sharply the data saturates towards s (higher value of p = sharper saturation). For example, the red points are a fit to Dan's data with s=7.5 and p=2:

Scaling graph

The consequence of this is that by using this fitting equation you easily get the maximum speed up, and some information about how the bottleneck is scaling with worker threads/processes.

danieljfarrell commented 3 years ago

@dcambie great detective work, I didn't know about those threading options inside numpy!

I'm trying to understand a bit better; your default system install had numpy running additional threads? And you needed to disable that to get good scaling? Is this caused by a particular python distribution, OS, or other custom setup? Or it just (unfortunately) happened to be default/stock for you? Any ideas in what situations threading is enabled?

@MarkusFF (I think) tested uses stock Python installer on Windows and macOS. I tend to use brew and pyenv on macOS only.

If so, we definitely need to make users aware of this. Thanks for figuring out a solution with the setting environmental variables.

danieljfarrell commented 3 years ago

@peter-spencer Included in the script now as a dashed line! Yes that fit's really well.

danieljfarrell commented 3 years ago

@dcambie I looked a bit more and see that I also had >1 default number of threads. So ignore my previous comment. It's just that the scaling performance seems to be very dependent on the system.

As an additional complexity, I noticed different (worse) scaling behaviour with the CLI than running a Python script! The CLI sets up a manager queue so that it can receive results as they occur from all the child processes, I'm guess that this is the bottleneck? I will need to look into this, and see if Pathos can help.

But yet again, I realise that multiple processing in Python is always more complicated than it appears.

danieljfarrell commented 3 years ago

CLI scaling

OK CLI scaling makes a little more sense.

Going to write up some results here which I drop into documentation at some point.

pvtrace cli scaling

tl;dr

IO bottleneck causes terrible scaling when the full history of all rays is written to disk. Writing fewer events using the --end-rays option improves the scaling dramatically. Use two workers with full ray history simulation, use four workers (or the number of physical CPUs) when when using --end-rays option.

Setup

The simulation speed-up is measured for pvtrace-cli with and without --end-rays option enabled.

Two duration are measured: 1) the simulation time is just the time to complete raytracing, and, 2) the complete time which is the total duration. The latter includes time to write all information to disk. We will see as the number of worker processes increases getting the data into the database fast enough becomes a bottleneck, particularly with full history simulation (i.e. --end-rays is not use).

Full history simulations

The simulation speed-up (blue line, left plot) scales linearly with two workers and then becomes sub-linear. However, the complete speed-up (purple line, left plot) saturates after two workers -- the time to remains more of less constant as additional workers are added. This is due to writing full ray-histories to database. The bottleneck becomes IO bound rather than CPU bound as data needs to be transfer to the database file process and also written to disk.

`--end-rays` simulations

Running with --end-rays reduces IO demand significantly because full ray history is not first transferred into a queue and then written to disk, resulting is much better scaling for the simulation speed-up (blue line, right plot) and complete speed-up (purple line, right plot). In fact, IO bottleneck is remove entirely as the two metrics become virtually indistinguishable. Although not shown in the plots, total time is improved slightly with using end rays option by around 20%.

Recommendations

Choose --workers option carefully when using pvtrace-cli. Run a small simulation with your script with around 1000 rays, starting with --workers 1, increase the workers until you get best performance.

Timing tests performed here are run on a system with 4 physical cores, results on your system might vary:

Best scaling with achieved with 2 workers when collecting full ray history.
Best results are achieved with 4 workers using the --end-rays option (on system with 4 physical CPUs). Improvements still occur using more workers but become sub-linear.

danieljfarrell commented 3 years ago

Oh, that’s fine then! I think I just got a notification because your cli branch is a pull request to my CLI, with all the Pathos stuff.

Still unsure about Pathos. What do you think?

I mean I like it, just that it does not seem to out-perform multiprocessing and adds a dependency. And we are not using any of the other advanced features it provides.

I’ve been looking into Cython! I think I have a way to speed up the core, see roadmap https://github.com/danieljfarrell/pvtrace/projects I thought as it’s a major change, might be best for version 3. This way, we can do all the concurrency in threads at the C-level and won’t need multiple processes at all (eventually.... hopefully... 🤞 )

Anyway, let me know your thoughts on Pathos and let’s decide if it’s worth changing over to it. I mean if you have ideas you want to try which require Pathos let’s do it! Otherwise, maybe we hold off.

dcambie commented 3 years ago

Oh, ok! Pathos, yes probably not worth it. My multiprocessing was just not performing well enough but now it is working as intended so I agree that there is no point in having it.

cython sounds interesting, but I have no experience with it so I cannot really comment on that :)

danieljfarrell / pvtrace