At present, the whole iteration space is done in one go. This means that if twelve buffers are needed to execute a program, twelve full-sized buffers are created and used. Not only does this waste memory, but it makes the cache work harder.
If only the final buffer is full-sized, the iteration space can be broken into chunks, meaning that every other buffer can be relatively small (say 1024 iterations). We could also then use this to implement multithreading down the line by having a pool of threads pick up chunks.
At present, the whole iteration space is done in one go. This means that if twelve buffers are needed to execute a program, twelve full-sized buffers are created and used. Not only does this waste memory, but it makes the cache work harder.
If only the final buffer is full-sized, the iteration space can be broken into chunks, meaning that every other buffer can be relatively small (say 1024 iterations). We could also then use this to implement multithreading down the line by having a pool of threads pick up chunks.