Closed MircoValentiniECMWF closed 1 year ago
However, I'm not so convinced I understand the PR itself, to add in-memory caching for reading ORCA grids? I see this plugin as a black-box for dependencies, as in, client code shouldn't directly concern with Atlas's plugins directly. This cache mapping could be downstream at the client code level, where one could cache any grid. Even at Atlas's level this would make more sense (as an optional, runtime mechanism in a factory.)
I think a broader discussion is needed. There are grids other than ORCA that are heavy to load, and a performance measurement should not include interpolation -- it's an orthogonal concern. Sure you see a benefit there, but that's translated into the overall run the same way. So, fine, if this solves a current problem, but I think there are more impactful ways to solve it.
However, I'm not so convinced I understand the PR itself, to add in-memory caching for reading ORCA grids? I see this plugin as a black-box for dependencies, as in, client code shouldn't directly concern with Atlas's plugins directly. This cache mapping could be downstream at the client code level, where one could cache any grid. Even at Atlas's level this would make more sense (as an optional, runtime mechanism in a factory.)
I agree that at Atlas's level this should be a runtime option, ideally in the factory. Whether the client code itself should implement the caching instead is debatable. In any case, the direct client here is mir and not multio. This is all in the context of interpolation/re-gridding and not that of fetching the co-ordinates to encode in grib2 and output it to disk/FDB. So caching in MIR perhaps is a better solution.
I think a broader discussion is needed. There are grids other than ORCA that are heavy to load, and a performance measurement should not include interpolation -- it's an orthogonal concern. Sure you see a benefit there, but that's translated into the overall run the same way. So, fine, if this solves a current problem, but I think there are more impactful ways to solve it.
See above. My understanding is that this is very much in the context of interpolation and the behaviour of mir re-loading the grids each time we carry it out. Yes, it is because we set up a separate MIRJob for each interpolation. Is it something we shouldn't do? Perhaps, but I believe it was discussed before and the upshot was that otherwise we'd need to cache a MIRJob for all different types of source grids, etc. A local cache closer to where the grid is actually loaded/created (in the factory indeed) seems sensible. Happy to discuss.
@wdeconinck It is evident that we require a mechanism to clear the cache. However, due to the current approach of invoking mir from multIO, it cannot be integrated into the ORCA destructor. In my view, the most appropriate time to clear the cache would be during the plugin deregistration process. This perspective aligns with the idea that caching should be a runtime option within the Atlas Grid factory, coupled with a logic to determine which grids should be cached (i.e. grids that require I/O)."
Please make this a runtime option.
Code in CMake like ORCA_IN_MEMORY_CACHE
going down to #if defined(HAVE_ORCA_IN_MEMORY_CACHE)
is not desirable.
Thanks Mirco for this. See individual comments. What's missing is a means to clear the cache. We should be able to clear it when wanted, and it should be cleared upon plugin library destruction.
Agreed. we need to be able to control caches including choosing to evict (clear).
I also tend to agree with @pmaciel in fact that caching would better be done at a higher level, e.g. the Grid object, and the same could be said about e.g. an expensive Mesh generation step. That would trigger a whole design change in multiple packages. e.g. MIR keeping an in-memory cache of grid objects alive, across invocations. This would then indeed also cater for any unstructured grid. From a design point it is rather obscure to have to rely on plugin or specialised grid construction mechanisms to do caching itself for performance.
I also tend to agree with @pmaciel in fact that caching would better be done at a higher level, e.g. the Grid object, and the same could be said about e.g. an expensive Mesh generation step. That would trigger a whole design change in multiple packages. e.g. MIR keeping an in-memory cache of grid objects alive, across invocations. This would then indeed also cater for any unstructured grid. From a design point it is rather obscure to have to rely on plugin or specialised grid construction mechanisms to do caching itself for performance.
I also agree. But please note we need an intermediate solution soo before we rearchitect much. I suggest we rearchitect the design once the Grid library is available. Meanwhile please have a solution but one dynamic without ifdefs and cmake build (hardwired) options.
I also tend to agree with @pmaciel in fact that caching would better be done at a higher level, e.g. the Grid object, and the same could be said about e.g. an expensive Mesh generation step. That would trigger a whole design change in multiple packages. e.g. MIR keeping an in-memory cache of grid objects alive, across invocations. This would then indeed also cater for any unstructured grid. From a design point it is rather obscure to have to rely on plugin or specialised grid construction mechanisms to do caching itself for performance.
I also agree. But please note we need an intermediate solution soo before we rearchitect much. I suggest we rearchitect the design once the Grid library is available. Meanwhile please have a solution but one dynamic without ifdefs and cmake build (hardwired) options.
Certainly several solutions to this particular multio performance problem could co-exist.
Following the track of this atlas-orca specific approach will require an atlas-orca config file setting to enable caching (and/or environment variable). Then, without rearchitecting designs, we can only clear the cache in the atlas-orca plugin destruction (deregistration). This caching should preferably only be enabled for the multio-pipelines mpi-ranks; not for compute mpi-ranks.
I leave it up to @pmaciel to clarify the much preferred MIR track further with proper cache management, and how much design change this really entails.
This apporach sounds very good to me, @wdeconinck -- thank you.
Closing as no longer needed. Thanks @MircoValentiniECMWF for your effort though!
Still rude implementation. Some notes:
1) An std::map has been chosen over an std::unordered_map. The reason is that this map is supposed to contain never more than 5 elements since the grid resolution remain constant during a simulation. In scenarios with a small number of elements like this, the computational cost associated with the hash algorithm in an unordered map can sometimes exceed the efficiency gains achieved through key-comparisons. Open to discuss this point.
2) The CMakeFiles.txt may seem unwieldy with its extensive if-else structure. Unfortunately, finding a more elegant solution proved challenging. Open to suggestions for improvement in this area.
3) Although using pointers within the map could potentially improve performance by reducing the need for copies, implementing this change would require more extensive code modifications. For the sake of maintaining code stability, I decided against such invasive changes. Open to suggestions for improvement in this area.
I tested on my personal machine and this modification gives a 10% to 20% boost on pure interpolation timing. We are still working to perform a benchmark with the full coupled simulation.