HPCE / hpce-2017-cw5

1 stars 6 forks source link

Memory Efficiency #19

Closed benfwalker closed 6 years ago

benfwalker commented 6 years ago

I just wanted to double check that the only performance metric was speed, and memory efficiency was not important?

m8pple commented 6 years ago

Yes speed is the only important thing. However, you do have a "subject to" constraint, as it to maximise speed, subject to not exhausting the available memory in the given AWS instance (i.e. don't crash).

Another recent issue touches on this, and it is worth thinking about time versus space complexity in general - if space has a similar complexity to time, then how much acceleration is possible given that memory is usually slower than compute (we measure bandwidth in GB/sec, but we measure compute in TOp/s)?

giuliojiang commented 6 years ago

So is it worth to make our optimized versions use less memory than the Reference versions so that they can run larger inputs without crashing? Or is it ok to have our optimized one use the same amount of memory as the Reference but a lot faster?

Making it use less memory could cost some overhead with small problem sizes, but make ridiculously large input sizes feasible. I imagine it depends on what the user wants to do with the algorithm, but it would be useful to know if it's worth spending time doing this in this coursework.

m8pple commented 6 years ago

When you encounter conflicts between speed at small scales versus speed at large scales, then remember that you can always switch between two different versions, depending on scale. For sufficiently small scales, the reference may be the fastest.

From an acceleration/optimisation point of view, there is often not a lot of difference between optimising compute complexity and memory complexity, unless there is a conflict between the two. Taking something from O(n^2) to O(n) in memory will often have somewhat similar benefits to taking something from O(n^2) to O(n) in compute. If you think of it in terms of communication (which is what memory accesses are), then performing O(1) transfers to O(n^2) memory locations is going to take more time than O(n) transfers to O(n) memory locations, because an O(n^2) size memory will be more expensive than O(n).

A Xeon has something like 20MB of cache, which is (let's say) 10x faster than DDR. Something that is O(n) is going to sit in that cache, and as a result be up to 10x faster on all memory transfers.

Whether you have to do it is up to you, and comes down to your personal evaluation of design-time versus improvement. There is no expected level of achievement for these puzzles, so there is no set of things that should or shouldn't be done, nor some level of performance that should be achieved. However, most people will hit the point of diminishing returns, where each extra step becomes more complex and has much less impact, and will decide to stop.

benfwalker commented 6 years ago

I guess a related question is, will our implementations ever be tested with inputs that the reference implementations can't handle, like very large scales?

m8pple commented 6 years ago

There is no real need for the reference implementation to be run on the same machine (and for time/money reasons it is better to pre-calculate reference answers offline). So the reference version works fine in 64-bit, I've got TBs of disk space locally for virtual memory, and the OS is more than happy to start swapping as the scale increases (well, not happy, but it will do it).

So the reference version can handle very large scales (with the right compiler, OS, and configuration), but will become 1000s of times slower. If the performance gets so bad that the reference version effectively doesn't finish, is that the same as the reference version not working?

The short answer is: yes, the scale schedule does go higher than the reference version can handle on the AWS machine. In fact, there is no limit on the upper bound of scale, only on time. See #1

giuliojiang commented 6 years ago

Then I think the best of the two worlds is to try to allocate the memory for the normal version, and if the allocation fails, fall back to the memory efficient (and slower) one. It is considerably slower due to the algorithm that automatically evicts blocks of memory that are no more needed, and because it makes a large number of calls to the allocator for small chunks instead of a single massive block.

benfwalker commented 6 years ago

@giuliojiang out of interest, are you talking about a specific puzzle?

giuliojiang commented 6 years ago

@benfwalker edit_distance

benfwalker commented 6 years ago

@giuliojiang interesting :P

giuliojiang commented 6 years ago

@benfwalker because if you try edit_distance with size 60000 (or more if you have more than 16GB of RAM), Reference crashes instantaneously (I also have no swap partition, having swap might push that limit up by a bit)

benfwalker commented 6 years ago

@giuliojiang indeed