Can something executed in parallel be slower than a sequential version (with a slight change in code logic)?

Hi,

I have a more subjective question that came about during my work on one of the puzzles. Where I was able to make a problem execute in parallel instead of sequential without any 'known' overhead I take on after my conversion of the block of code. It takes around 3 minutes to do the exact same thing in parallel against a 1 minute execution time in the sequential version. I would love to be more specific about the problem but that would go against the requirements of this assignment. Again can something done in parallel which is pretty much what is done in the sequential version be slower?

Platform: Mac OS X (using the CPU right now)

other tidbits to the problem: user time is 1m38 seconds and real time is 24.083 seconds. (for the parallel version) with a 12 second real time and 10 second user time in the sequential version.

Yes, it can definitely be slower. Some common causes are:

Not enough work happening in each task (i.e. agglomeration too low), due to under-estimating the compute intensity at each point.
Some kind of hidden communication or contention at the hardware level, usually due to some kind of hidden sharing or at the hardware level. For example an atomic variable, or an array where different tasks are writing to different but adjacent memory locations (fighting over cache lines).
Some kind of hidden synchronisation due to a function call that contains some kind of lock. Library functions like printf, malloc, new, and so-on often contain mutual exclusion to serialise access.
(Less common) Converting a piece of code from inline code to a lambda or kernel can remove an optimisation opportunity that the compiler was able to see when it was inline. This is usually related to poor agglomeration, though it can sometimes be related to calculations that should be hoisted outside of loops.

You should probably try playing with agglomeration/grain-size first, including making it so big that there is only one task and see if it is the same speed as sequential. Then using a sampling profiler like perf might lead you to a particularly contended or slow area of code (compared to the profile for the over-agglomerated version).

HPCE / hpce-2017-cw5

Can something executed in parallel be slower than a sequential version (with a slight change in code logic)? #49