Call for Review: DASH Tutorial at HIPEAC Stockholm

knuedd commented 7 years ago

Dear DASH crew,

if you like have a look at the main example for the upcoming DASH tutorial in Stockholm in January.

SPOILER ALERT: If you are a tutorial attendee, read further on your own risk. You will miss half the fun ;)

Please look at https://github.com/dash-project/dash-apps/tree/master/dash_tutorial_2017-01 --> 05-astro_countobjects/. The 02 to 04_ examples are the same thing but with the later steps missing. The 01_intro/ is tbd. Read the README.md first for the dependencies.

The astro-assignment.cpp is what we should give to the attendees together with some explanations. The astro-solution.cpp is ... well ... the solution. Use 'make diff' to see the difference between the two.

First of all, let's discuss what you like and what you don't like. Keep in mind that we only have 3 hours and want to bring the key idea across. I also plan to have the example implemented with MPI RMA directly to demonstrate how many lines of code we save.

With this code I also discovered a few issues, that I'll report under https://github.com/dash-project/dash.

Cheers, Andreas

fuchsto commented 7 years ago

@knuedd Yes, please report issues as soon as possible, we will give them highest priority! Also, I like that you give Hubble some love.

fuchsto commented 7 years ago

To answer your questions from the implementation of the tutorial application

The barrier after computation of local histograms is not needed. dash::transform ist a one-sided, non-collective operation and does not require any synchronization. Values in local histograms don't depend on other units' results so there is no need to synchronize local completion. Every unit just adds its local result onto the master unit's histogram whenever it is ready. We just have to synchronize units for the overall completion of the global histogram.
The transformation function passed to dash::transform is executed as atomic operation on single elements in the range. Eventually, dash::transform corresponds to MPI_Accumulate + MPI_Win_flush (blocking until target signals remote completion) so consistency and behavior corresponds to the MPI specs: http://www.mpich.org/static/docs/v3.1/www3/MPI_Accumulate.html http://www.mpich.org/static/docs/v3.1/www3/MPI_Win_flush.html

Also added as comments in your code:

https://github.com/dash-project/dash-apps/blob/master/dash_tutorial_2017-01/04-astro_marker_color/astro.cpp#L284

https://github.com/dash-project/dash-apps/blob/master/dash_tutorial_2017-01/04-astro_marker_color/astro.cpp#L300

You are referring to my implementation of the histogram benchmark (bench.04.histo-tf): https://github.com/dash-project/dash/blob/development/dash/examples/bench.04.histo-tf/main.cpp

This variant is recommended for clusters. On a single node (< 48 cores) and for smaller ranges, the variant in bench.04.histo is slightly more efficient but has terrible scaling. Might be interesting to compare them as a motivation to use DASH algorithms whenever possible.

fuchsto commented 7 years ago

Another feature you could demonstrate is how threaded parallelism is automatically balanced for hardware capacities. So if there is some time left on the schedule, integrating OpenMP for any process configuration is pretty simple:

dash::util::UnitLocality uloc;
// number of threads available to this unit
auto n_threads = uloc.num_domain_threads();
#pragma omp parallel num_threads(n_threads)
{
   // For example, on a 32-core system:
   // - running with mpirun -n 8 -> all units run 4 threads
   // - running with mpirun -n 5 -> two units run 7 threads, three units run 6 threads
}

knuedd commented 7 years ago

Hi all,

I need some help again. Please look at the example dash-apps/dash_tutorial_2017-01/05-astro_count_objects/astro-benchmark.cpp which is the tutorial code with some debug and benchmark output.

I see bad bugs and might have done things totally wrong. I see strange results when running it with different numbers of units.

I already changed dash::Matrix to dash::NArray, now the local extents are correct ... wait, I just realized one bug when doing the local pointer arithmetic. Well, ignore everything after /* ** part 5 ... / for now.

But look at /* ** part 4 ... / please. There every unit iterates over its local block with the local iterator and computes the sum of all pixel brightnesses. When you run it with 3 units, no block is underfilled and everything is fine. But with 4 units that are arranged as 2x2 units, then 3 out of 4 blocks are underfilled. The last unit (unit 3) produces the sum of all pixel brightnesses as 0! This means all the pixel values read from the image did not make it to that unit. It is always only the very last unit of you use any number of units that leads to an actual nxm arrangement of units with n,m>1.

Do you have any idea why this could be?

Thanks, Andreas

dhinf commented 7 years ago

If you run the example with 9 units -> 3x3 the result is even worse. The reason is dash::copy. The local to global copy doesn't work well for more than one remote unit. I implemented the setup for the matrix without dash::copy and everything worked as expected. Maybe Tobias can explain why dash::copy behaves this way.

fuchsto commented 7 years ago

I'm on it, I suspect the recent refactoring runs to have harmed semantics.

dash-project / dash-apps

Call for Review: DASH Tutorial at HIPEAC Stockholm #1