caracal-pipeline / crystalball

Distributed prediction of visibilities from a sky model
GNU General Public License v2.0
2 stars 5 forks source link

performance optimisation #15

Closed paoloserra closed 4 years ago

paoloserra commented 5 years ago

I'm still to learn how to make the most of crystalball. My tests so far have had the following results:

components: 2,513 rows: 429,000 channels: 20 file size: 650M execution time: 8m

components: 2,513 rows: 6,443,430 channels: 20 file size: 8G execution time: 2h30m

components: 2,513 rows: 6,443,430 channels: 100 file size: 45G execution time: 8h50m

In all cases crystalball made decent use of my 8 CPUS (constant at about 80%), with a constant memory usage of 20 GB.

In all cases I used a chunk size of 100 for both model components and data rows. Trying with larger row chunks led to excessive memory usage.

All the above was on a 4 cores, 8 threads i7-7700k, 4.20GHz CPU with a 64GB RAM.

@o-smirnov @sjperkins maybe this is something we could work on during the busyweek.

sjperkins commented 5 years ago

Agreed, I haven't had a good chance to look at this from my side

sjperkins commented 5 years ago

Statement: We choose small row and model chunk sizes to avoid running out of memory. This results in sub optimal performance because numba isn't given sufficient work to do.

First Solution

Use the dask distributed scheduler and wait for dask developers to solve memory backpressure issues in dask https://github.com/dask/distributed/issues/2602

Second Solution

Coarsen the granularity of dask parallelism. Briefly, rather than submitting the entire problem to the scheduler, we move most of the RIME internals into a single, per-row chunk function, which is managed by dask.

Third solution

Restrain the amount of parallelism in the graph by serialising what are large parallel reductions into fewer, serial chains.

paoloserra commented 5 years ago

What is the indication of sub-optimal performance? I thought that in my examples above, with CPU usage constantly above 80% and not excessive RAM usage, the performance was as good as it could possibly be. Is that a correct assessment?

sjperkins commented 5 years ago

I think CPU usage should be > 95%.

On the other hand, we're using einsum to multiply all the jones terms together and this may also not be the most optimal Jones matrix multiplication possible. But it is the most flexible method for combining Jones terms. Something else to drill down into. I'm going to have to profile it properly to see what's taking most of the time.

Thanks for reminding of me this, swapping everything back into short term memory. Thank goodness for writing things down :-)

paoloserra commented 5 years ago

I think CPU usage should be > 95%.

That'd be nice -- but it wouldn't change things too dramatically compared to ~80%.

My hope is that, given a machine specs and possibly some user contraints (e.g., one may only want to use N threads, or limit RAM usage to a certain fraction) we can make crystalball decide how to best chunk up the problem.

sjperkins commented 5 years ago

My hope is that, given a machine specs and possibly some user contraints (e.g., one may only want to use N threads, or limit RAM usage to a certain fraction) we can make crystalball decide how to best chunk up the problem.

So I did a lot of this in montblanc and it worked fairly well -- at the cost of flexibility. The problem is that the the dimensionality of the inputs, intermediate arrays, outputs and problem sizes (time, chan, baselines, ants, beam cube size, etc. etc. etc. etc. etc.) have to be specified upfront and the expression has to be analysed (i.e. derive a graph from it) to figure out how memory flows through it.

Specifying the problem so exactly was one of the factors that made it very difficult to add new features.

We're encountering this kind of issue across multiple software products and I've been discussing this with colleagues in the office -- I've added a third possible solution to https://github.com/paoloserra/crystalball/issues/15#issuecomment-499010189

paoloserra commented 5 years ago

https://github.com/paoloserra/crystalball/commit/6bd232fbb80000870df4e90ac3fd0eb299d18924 implements automated setting of the chunk size based on the system RAM and nr of available threads. Users can control the fraction of system RAM that they want crystalball to use (default 50%).

To automatically set the chunk size give -mc 0 -rc 0 .

When setting the chunk size automatically, the ratio Nr(rows)/Nr(models) per chunk is hardcoded to 100; and there is a fudge_factor hardcoded = 2 to increase the expected memory usage relative to the theoretical one. See https://github.com/paoloserra/crystalball/blob/f8167af54251af69d4abfaa3238d4b3c971721f3/Crystalball/budget.py#L5 .

I've tested this on a 350,000 rows, 98 chans, 2 pols .MS with 7,000 sources and it completed in 1h10m using 8 threads at a ~90% level and, as expected, 50% RAM. It would be good to test it on different machines and more challenging datasets.

paoloserra commented 5 years ago

Here's the results of yet another test -- the largest I've run so far.

components: 4,603 rows: 6,443,430 channels: 100 execution time: 14h40m

This is less than the expected 16h10m based on the largest test listed in the first post of this issue. Part of the reason might be that this time I let crystalball decide how to chunk up the problem. It ran with 114 sources and 11474 rows per chunk. As a result the memory usage was higher than in the tests above: 30-45 GB. So it seems that I made better use of my machine, and therefore was faster.

The RAM usage was a little higher than the nominal 50% based on the default crystalball settings.

Unlike in the tests above, this time around both CPU and RAM usage seemed to oscillate -- but with a pretty high average value.

(As a reminder, this was on a 4 cores, 8 threads i7-7700k, 4.20GHz CPU with a 64GB RAM.)

sjperkins commented 5 years ago

This is less than the expected 16h10m based on the largest test listed in the first post of this issue. Part of the reason might be that this time I let crystalball decide how to chunk up the problem. It ran with 114 sources and 11474 rows per chunk

I think this is because the chunk sizes were larger, resulting in a smaller graph and more time spent in the numba compute, rather than the python glue.

paoloserra commented 5 years ago

https://github.com/ska-sa/codex-africanus/pull/143 gives hope for better performance. To be followed up!

paoloserra commented 5 years ago

I've updated codex (git describe --tags gives 0.1.8-29-gd4f1244) and run some of the above tests again. These are the results.

crystalball 1532133354-NGC1399_corr.ms -sm ../../myreduction/output/edb_2-sources.txt -iuvw -sp
Successful read/write open of default-locked table 1532133354-NGC1399_corr.ms: 26 columns, 6443430 rows
-------------------------------------------
system RAM = 62.78 GB
nr of logical CPUs = 8
nr sources = 4603
nr rows    = 6443430
nr chans   = 100
nr corrs   = 2
sources per chunk = 114 (auto setting)
rows per chunk    = 11474 (auto setting)
expected memory usage = 31.19 GB
-------------------------------------------

old execution time: 14h40m new execution time: 10h40m

crystalball ngc1399_20MHz-afr.ms -sm ../../myreduction/output/edb_2-sources.txt -iuvw -ns 2500 -sp
Successful read/write open of default-locked table ngc1399_20MHz-afr.ms: 28 columns, 439200 rows
-------------------------------------------
system RAM = 62.78 GB
nr of logical CPUs = 8
nr sources = 2500
nr rows    = 439200
nr chans   = 20
nr corrs   = 2
sources per chunk = 256 (auto setting)
rows per chunk    = 25657 (auto setting)
expected memory usage = 31.32 GB
-------------------------------------------

old execution time: 8m new execution time: 5m

So for the largest dataset crystalball was indeed faster but only by a factor 1.4. For the smallest dataset crystalball was a factor of 1.6 faster.

In the case of the smallest dataset I've also made a run without the -sp option and that didn't change the execution time. Finally, a run on the smallest dataset using -rc 100 -mc 100 (for consistency with my initial tests reported above) completed in 6m (it used to be 8m, and it is 5m with automated chunking).

As noted in a previous post, since the Cagliari busy week the memory usage oscillates by a fair amount (see attached image; memory usage used to be flat). I don't know whether this is because of the automated chunking, heavier tests or fundamental changes in the code.

crystalball_performance

IanHeywood commented 5 years ago

I can post my timings here if it's helpful to you?

crystalball -sm img_1539630057_sdp_l0.full_1284.full_pol_wtspec_J0332-2807.ms_pcalmask-sources.txt -o DIR1 -w parsets/PKS0326-288.reg --spectra -j 32 -mf 0.8 1539630057_sdp_l0.full_1284.full_pol_wtspec_J0332-2807.ms
Successful read/write open of default-locked table 1539630057_sdp_l0.full_1284.full_pol_wtspec_J0332-2807.ms: 29 columns, 5336402 rows
-------------------------------------------
system RAM = 503.81 GB
nr of logical CPUs = 32
nr sources = 54
nr rows    = 5336402
nr chans   = 1024
nr corrs   = 4
sources per chunk = 45 (auto setting)
rows per chunk    = 4542 (auto setting)
expected memory usage = 399.20 GB
-------------------------------------------
Nr sources        = 54
-------------------------------------------
stokes.shape      = (54, 4)
frequency.shape   = (1024,)
Is.shape          = (54, 1024)
spectrum.shape    = (54, 1024, 4)
brightness.shape  = (54, 1024, 2, 2)
phase.shape       = (54, 5336402, 1024)
-------------------------------------------
Attempting phase-brightness einsum with "srf, sfij -> srfij"
jones.shape       = (54, 5336402, 1024, 2, 2)
-------------------------------------------
All sources are Delta functions
-------------------------------------------
[########################################] | 100% Completed | 20hr 13min 39.7s
sjperkins commented 5 years ago

Thanks, all useful. @paoloserra I still haven't found time for performance testing on stills.

paoloserra commented 5 years ago

Thank you @IanHeywood , that's very useful to know. As you may gather from this thread, we are keen to improve the performance.

Execution time aside, I would like to know whether the end result was good. (Maybe worth a GutHub issue of its own.)

sjperkins commented 5 years ago

As noted in a previous post, since the Cagliari busy week the memory usage oscillates by a fair amount (see attached image; memory usage used to be flat). I don't know whether this is because of the automated chunking, heavier tests or fundamental changes in the code.

@JSKenyon and I were just discussing this now w.r.t. CubicalV2 and we postulated that because the source dimension is larger, the memory chunks for the RIME terms are commensurately larger. This results in a "heartbeat" pattern where a pulse corresponds to each row chunk.

paoloserra commented 4 years ago

@sjperkins ever since the speed increase linked to, I think, https://github.com/ska-sa/codex-africanus/pull/143 (and possibly more) I've experienced somewhat unpredictable memory patterns when running crystalball on my desktop.

It typically results is substantial swapping, which I had never seen before that change. In some cases both RAM and swap slowly rise up to about 100% over the course of a few hours (both are 64 GB on my desktop) and crystalball crashes with a memory error or gets killed. I've had cases in which this happened even though the expected memory usage was only 10% of the available RAM. In all these cases crystalball was the only significant task running.

For some of these cases I've gone back to the older version, currently available at https://github.com/paoloserra/crystalball/tree/backto14 and working in Python2, and found a much more stable and predictable memory usage with no swapping using the same settings that led to memory issues with the crystalball 0.2.2 .

If you can think of any tests that could help us understand why this has changed compared to the old crystalball please let me know. I'd be happy to run them. Also, if you need more precise measurements of, e.g., RAM and swap vs time let me know and I can try.

sjperkins commented 4 years ago

@sjperkins ever since the speed increase linked to, I think, ska-sa/codex-africanus#143 (and possibly more) I've experienced somewhat unpredictable memory patterns when running crystalball on my desktop.

It typically results is substantial swapping, which I had never seen before that change. In some cases both RAM and swap slowly rise up to about 100% over the course of a few hours (both are 64 GB on my desktop) and crystalball crashes with a memory error or gets killed. I've had cases in which this happened even though the expected memory usage was only 10% of the available RAM. In all these cases crystalball was the only significant task running.

For some of these cases I've gone back to the older version, currently available at https://github.com/paoloserra/crystalball/tree/backto14 and working in Python2, and found a much more stable and predictable memory usage with no swapping using the same settings that led to memory issues with the crystalball 0.2.2 .

If you can think of any tests that could help us understand why this has changed compared to the old crystalball please let me know. I'd be happy to run them. Also, if you need more precise measurements of, e.g., RAM and swap vs time let me know and I can try.

I remember we discussed and you said that you saw stability in the memory profile inside MeerKATHI docker containers, but instability outside. Does this apply to your description of the problem above?

paoloserra commented 4 years ago

No it doesn't, all of the above applies to runs of stand-alone crystalball inside a python venv, but no containers.

sjperkins commented 4 years ago

I've been taking a look at this today but haven't found a solution yet.

paoloserra commented 4 years ago

Thanks for the effort @sjperkins . Just to clarify, can you confirm that you see the same memory issues? Until now I thought I was the only one who had run into them.

sjperkins commented 4 years ago

@paoloserra The following PR https://github.com/ska-sa/dask-ms/pull/75 solves the OutOfMemory problem for me, without having to resort to using the distributed scheduler approach in https://github.com/paoloserra/crystalball/pull/33.

By solve, I mean that the memory usage oscillates around 60% memory usage (memory_fraction=0.25) after an hour with testcb.ms and 40,000 sources.

$ crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
Successful read/write open of default-locked table /home/sperkins/data/testcb.ms/: 26 columns, 349530 rows
-------------------------------------------
system RAM = 15.49 GB
nr of logical CPUs = 8
nr sources = 41427
nr rows    = 349530
nr chans   = 98
nr corrs   = 2
sources per chunk = 40 (auto setting)
rows per chunk    = 4071 (auto setting)
expected memory usage = 3.80 GB
-------------------------------------------
Nr sources        = 41427
-------------------------------------------
stokes.shape      = (41427, 4)
frequency.shape   = (98,)
/home/sperkins/work/ska/code/codex-africanus/africanus/rime/dask.py:41: PerformanceWarning: Increasing number of chunks by factor of 86
  dtype=infer_complex_dtype(lm, uvw, frequency))
brightness.shape  = (41427, 2)
phase.shape       = (41427, 349530, 98)
-------------------------------------------
Attempting phase-brightness einsum with "srf, si -> srfi"
jones.shape       = (41427, 349530, 98, 2)
-------------------------------------------
All sources are Delta functions
-------------------------------------------
[########                                ] | 21% Completed |  1hr  1min 21.6s

Would you be willing to try https://github.com/ska-sa/dask-ms/pull/75 with crystalball master branch?

sjperkins commented 4 years ago

Still going with the same average memory usage at ~3 hours

$ crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
Successful read/write open of default-locked table /home/sperkins/data/testcb.ms/: 26 columns, 349530 rows
-------------------------------------------
system RAM = 15.49 GB
nr of logical CPUs = 8
nr sources = 41427
nr rows    = 349530
nr chans   = 98
nr corrs   = 2
sources per chunk = 40 (auto setting)
rows per chunk    = 4071 (auto setting)
expected memory usage = 3.80 GB
-------------------------------------------
Nr sources        = 41427
-------------------------------------------
stokes.shape      = (41427, 4)
frequency.shape   = (98,)
/home/sperkins/work/ska/code/codex-africanus/africanus/rime/dask.py:41: PerformanceWarning: Increasing number of chunks by factor of 86
  dtype=infer_complex_dtype(lm, uvw, frequency))
brightness.shape  = (41427, 2)
phase.shape       = (41427, 349530, 98)
-------------------------------------------
Attempting phase-brightness einsum with "srf, si -> srfi"
jones.shape       = (41427, 349530, 98, 2)
-------------------------------------------
All sources are Delta functions
-------------------------------------------
[########                                ] | 21% Completed |  1hr  1min 21.6s^Z
[1]+  Stopped                 crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
(crystalball) sperkins@simon-Y700:~$ fg
crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
[####################                    ] | 52% Completed |  2hr 56min 39.2s^Z
[1]+  Stopped                 crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
sjperkins commented 4 years ago

And finishes after ~6 hours:

$ crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
Successful read/write open of default-locked table /home/sperkins/data/testcb.ms/: 26 columns, 349530 rows
-------------------------------------------
system RAM = 15.49 GB
nr of logical CPUs = 8
nr sources = 41427
nr rows    = 349530
nr chans   = 98
nr corrs   = 2
sources per chunk = 40 (auto setting)
rows per chunk    = 4071 (auto setting)
expected memory usage = 3.80 GB
-------------------------------------------
Nr sources        = 41427
-------------------------------------------
stokes.shape      = (41427, 4)
frequency.shape   = (98,)
/home/sperkins/work/ska/code/codex-africanus/africanus/rime/dask.py:41: PerformanceWarning: Increasing number of chunks by factor of 86
  dtype=infer_complex_dtype(lm, uvw, frequency))
brightness.shape  = (41427, 2)
phase.shape       = (41427, 349530, 98)
-------------------------------------------
Attempting phase-brightness einsum with "srf, si -> srfi"
jones.shape       = (41427, 349530, 98, 2)
-------------------------------------------
All sources are Delta functions
-------------------------------------------
[########                                ] | 21% Completed |  1hr  1min 21.6s^Z
[1]+  Stopped                 crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
(crystalball) sperkins@simon-Y700:~$ fg
crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
[####################                    ] | 52% Completed |  2hr 56min 39.2s^Z
[1]+  Stopped                 crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
(crystalball) sperkins@simon-Y700:~$ fg
crystalball -sm ~/data/edb_2-sources-max.txt ~/data/testcb.ms/ -mf 0.25
[########################################] | 100% Completed |  5hr 54min 31.8s
(crystalball) sperkins@simon-Y700:~$
paoloserra commented 4 years ago

Thanks to @sjperkins 's work, and following an initial improvement in crystalball 0.2.4, the new crystalball 0.3.0 has a stable memory usage and efficient CPU usage. Here's the first 24h of a run where I had requested a 40% memory usage on 42/56 CPUs.

sj_4

I think that we are at a good point now, and this issue can be closed. Further improvements might be possible but let's discuss them in a new issue whenever the time is right.