With 4GB per each input and output array, this means each predict worker is allocating 8GB of shared memory. With 4 workers, job memory should then be at least 32GB. Now, here is the performance bug. I did not know about the shared memory requirement and have always run my inference pipeline with 4 workers with only 8GB of memory (to minimize my resource usage counting :)). You'd think that gunpower would run out of memory and be killed, but actually it will not! Turns out, the reason for this is because Python multiprocessing package creates a temp file and mmap that whenever RawArray is used: src. So it does not matter how many workers are being run and how much shared memory is being allocating Python will happily chug along although with possible slow downs of memory being swapped in and out to disk.
The most immediate slow down is during initialization when the array is set to zero. I have seen my inference jobs taking more than 15m to initialize all of the RawArrays to disk tempfiles (vs less than 1m if there were enough memory).
The second-order bug would be during runtime and data is paged out from main memory to disk. In my inference jobs, once getting past the initialization, 8GB was actually enough for four workers, but I can imagine scenarios where not enough memory was requested for the jobs and data is paged out on every transfer. I don't know exactly what the mechanism is inside the OS for it to decide when to page something out from a memory mapped file, but we probably should avoid this scenario at all time because it can be an opaque performance bug.
My recommendations are:
At the very least, the max_shared_memory argument should be made more transparent to the user. Maybe something like shared_memory_per_worker_GB and then calculate the appropriate max_shared_memory from that.
The default for max_shared_memory should be decreased substantially. I'm guessing that most production jobs won't be transferring more than a few hundred MBs, so maybe the default should be capped to something like 64MB or 128MB and the more experimental users can increase it accordingly.
Turns out that the default value for
max_shared_memory
https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L71 does not allocate 1GB of rather 4GB because the value typectypes.c_float
is used when for creating theRawArray
. https://github.com/funkey/gunpowder/blob/e523b49ca846a9fd46ab6fc0dd1040cc4a4d53b4/gunpowder/tensorflow/nodes/predict.py#L90With 4GB per each input and output array, this means each predict worker is allocating 8GB of shared memory. With 4 workers, job memory should then be at least 32GB. Now, here is the performance bug. I did not know about the shared memory requirement and have always run my inference pipeline with 4 workers with only 8GB of memory (to minimize my resource usage counting :)). You'd think that gunpower would run out of memory and be killed, but actually it will not! Turns out, the reason for this is because Python multiprocessing package creates a temp file and mmap that whenever
RawArray
is used: src. So it does not matter how many workers are being run and how much shared memory is being allocating Python will happily chug along although with possible slow downs of memory being swapped in and out to disk.The most immediate slow down is during initialization when the array is set to zero. I have seen my inference jobs taking more than 15m to initialize all of the
RawArray
s to disk tempfiles (vs less than 1m if there were enough memory).The second-order bug would be during runtime and data is paged out from main memory to disk. In my inference jobs, once getting past the initialization, 8GB was actually enough for four workers, but I can imagine scenarios where not enough memory was requested for the jobs and data is paged out on every transfer. I don't know exactly what the mechanism is inside the OS for it to decide when to page something out from a memory mapped file, but we probably should avoid this scenario at all time because it can be an opaque performance bug.
My recommendations are:
max_shared_memory
argument should be made more transparent to the user. Maybe something likeshared_memory_per_worker_GB
and then calculate the appropriatemax_shared_memory
from that.max_shared_memory
should be decreased substantially. I'm guessing that most production jobs won't be transferring more than a few hundred MBs, so maybe the default should be capped to something like 64MB or 128MB and the more experimental users can increase it accordingly.