GeoscienceAustralia / uncover-ml

Machine Learning system for Geoscience Australia uncover project
Apache License 2.0
30 stars 20 forks source link

MPI shared memory #104

Open brenmous opened 4 years ago

brenmous commented 4 years ago

When implementing bagging/bootstrapped ensemble models I tried to be clever and implement an MPI shared memory window. The intention was that instead of duplicating the training data across all nodes (very memory intensive) we just set up a shared window on the root node. This means we can train big ensemble models in parallel with a minimal memory footprint. There are a few issues though:

  1. I don't know if it actually works. When initially testing, I created the shared window and shared the data in the same scope that it was being used. This showed a decrease in memory usage. I then refactored the shared memory creation routine to mpiops.py and shared it as soon as data was intersected, passing it around as SharedTrainingData named tuple (defined in geoio.py). I have a suspicion that by doing this, and by the window going out of scope, the numpy arrays are being copied by each process and no longer being shared. My assumption was that the buffer created by the shared window that is used to initialize the numpy arrays would be preserved, but the memory usage picked back up after implementing this.

  2. We can't use shared memory across multiple compute nodes on the NCI. If a job spreads across more than one node, the shared memory window approach no longer works as it can only share on the same physical memory. This means for really memory hungry jobs we have to use a hugemem queue.

Need to either doubledown on the shared memory approach and ensure it's working as intended and get it to work across networked compute nodes, or scrap it and go back to duplicating memory across nodes when parallel training is needed.