18to6 performance - Githubissues

vanroekel commented 7 years ago

At the request of @milenaveneziani I have been doing timing/testing of MPAS-Analysis. I want to log some information here

1) The MOC on develop does not work for the 18to6 as expected. It hangs in the "load data" phase of the computation. Note this test was for a single year of data.

2) I have run the diagnostics for all_seaIce for the 18to6 on 4 years of data. It took 7.5 hours to complete the seaIce diagnostics alone. This seems concerning to me. If we assume the ocean diagnostics are roughly equivalent, it would be 15 hours for the timeSeries and regridded plots alone for 4 years of data. I will test on the individual tasks to break timings down further.

xylar commented 7 years ago

So the first thing we can do to speed things up is task parallelism (#129). This would be further helped by breaking analysis tasks like sea_ice.modelvsobs and sea_ice.timeseries into subtasks (#133).

From there, we would need to know what the bottlenecks are. If it is loading data, we may need to discuss storing data in a more analysis-friendly way with fewer variables and more time slices per file. Or we may need an intermediate (and optimized) step for pulling to data required by analysis out of the data files and caching it.

If it is computing climatologies, we should explore using nco instead of python or adding threading or MPI parallelism to MPAS-Analysis (a considerable challenge).

@vanroekel, thanks for getting us started. I don't know what is feasible but it might be helpful to move 1 year worth of data (including namelists, streams files, timeSeriesStatsMonthly history files and one restart file each for ocean and sea ice) to a location on turquoise or NERSC. Or I just need to get an account on Titan.

pwolfram commented 7 years ago

@vanroekel, #129 as developed by @xylar should help out a ton with speedup.

Note, we need to be cognizant of the cost to load in memory from disk, which is related to on-node bus speed as well as the total amount of data that needs to be read into memory. This could be the ultimate bottleneck with very large datasets. At that point, we have no real recourse but to use the dask.distributed backend to xarray that allows us to use multiple nodes to read in memory simultaneously, which would allow us to effectively cut the run time by just adding nodes because reading from disk would be parallel. In that scenario, having more time slices in fewer files actually hurts performance. Hence, I would be careful about changing file structure in the short term. It could be a lot of work that may not be quite the performance gain we expect in the long term.

I think the lowest hanging fruit past #129 is probably caching strategies, e.g., compute climatologies etc incrementally, similar to what @xylar has already been employing.

vanroekel commented 7 years ago

@xylar I am now moving a year of 18to6 (ocean timeSeriesStatsMonthly) data to NERSC (/global/cscratch1/sd/lvroekel/18to6Titan) I'll fix up permissions once the transfer is done. Note, that surface area averages is not on for this data, thus the SST time series analysis will not work. Once the ocean data transfers, I'll move seaIce.

A bit of an update on timing, computing ice timeseries alone took 4 hours for the 4 years of data.

xylar commented 7 years ago

Thanks, @vanroekel.

A bit of an update on timing, computing ice timeseries alone took 4 hours for the 4 years of data.

This is almost certainly all about file read time, since the data itself isn't likely to be that big and averaging it should be fast.

@pwolfram, it would be hard to argue against the idea of having fewer variables per file so you didn't have to seek through a file to find what you need (except, of course that you then have a ton of different files). I get what you're saying about multiple time slices in a single file.

For now, I don't think we have the expertise to solve these issues. I guess we're going to have to develop it.

pwolfram commented 7 years ago

@pwolfram, it would be hard to argue against the idea of having fewer variables per file so you didn't have to seek through a file to find what you need (except, of course that you then have a ton of different files). I get what you're saying about multiple time slices in a single file.

I agree in general, although care is needed here because some file systems have more trouble when there are more smaller files as opposed to fewer larger files, so this may not necessarily be a clear-cut win.

For now, I don't think we have the expertise to solve these issues. I guess we're going to have to develop it.

@xylar, can you please specify what expertise we don't have that you think we need? If we use dask.distributed and can effectively load files in parallel this would largely solve the problem because we can just use more nodes to get close to ideal speed-up. This seems like a long-hanging fruit win to me, assuming there aren't too many problems with the admittedly nascent dask.distributed. I'm happy to tap my xarray and dask networks to bring in the correct ideas. These individuals certainly have the necessary expertise. Is there any reason we shouldn't go this route planning on using dask.distributed on top of xarray?

xylar commented 7 years ago

@xylar, can you please specify what expertise we don't have that you think we need? If we use dask.distributed and can effectively load files in parallel this would largely solve the problem because we can just use more nodes to get close to ideal speed-up. This seems like a long-hanging fruit win to me, assuming there aren't too many problems with the admittedly nascent dask.distributed. I'm happy to tap my xarray and dask networks to bring in the correct ideas. These individuals certainly have the necessary expertise. Is there any reason we shouldn't go this route planning on using dask.distributed on top of xarray?

This is the area where we don't have the necessary expertise ourselves and may need to develop it. I'm not as optimistic as you seem to be that using dask.distributed will a) be easy or b) have the performance we need (reasonable total execution time). Maybe once Luke has some files at NERSC for us to play with, we can see what the options are.

vanroekel commented 7 years ago

one other possibility is to utilize the timeSeriesStatsClimatology capability in MPAS (suggested in discussion with @milenaveneziani). This would require far fewer files to be read.

pwolfram commented 7 years ago

@xylar, how about this:

I think the simplest thing we do is a mean, e.g., "climatology", so I would propose a compute off "contest" to compute the mean where we have a set of netcdf files and essentially the fastest and cleanest code implementation for a mean, including and excluding writing results to disk, wins. The metrics:

loading time of data and compute time (together because this is really the time to solution)
writing time of solution to disk (we need this because we probably should be caching data results)
number of code lines written (I would argue this is secondary to 1 and 2 to a limit: e.g., as long as we are within a factor of 2-3 I would take a 10 line solution over a 100-1000 line solution that is marginally faster.)
programmer time (This is less of an issue for easy to implement operations, but more of an issue for complex operations. We should keep track of the implementation time for this solution as part of our metrics, however.)

Does this seem reasonable to you? We would use the same dataset on NERSC and the computation and its metrics would be computed via a submitted job so that dedicated nodes are used (and to also avoid spurious performance due to burst-buffer caching).

pwolfram commented 7 years ago

@vanroekel, you are proposing to compute more analytics online, is that correct? That would most certainly alleviate many of the challenges we are facing.

xylar commented 7 years ago

So part of the problem in terms of plotting time series is that we're storing the regionally averaged (therefore tiny) data sets in the same files as the 3D time-averaged data sets. If we put regionally averaged data in their own files, these would be small, manageable files and time series could be produced in about the same amount of time regardless of the size of the MPAS mesh. I think this should be added to the agenda of tomorrow's ACME ice-ocean meeting, and maybe spawn a special topic meeting later this month.

xylar commented 7 years ago

@pwolfram, your compute-off sounds reasonable if there is time to develop several competing solutions. I really need to be developing land-ice related metrics instead of improving what we have. So I won't be able to put time into this particular area of optimization. My main concern is that performance is going to drain a lot of our already very limited resources. I don't know what to do about that, other than try to have as much of the heavy lifting done on the MPAS side, rather than on the analysis side.

vanroekel commented 7 years ago

@xylar I think a special meeting regarding the analysis is on @milenaveneziani's to-do list. And @pwolfram yes, online climatology is indeed what is proposed.

pwolfram commented 7 years ago

@xylar, I think having more work done on the MPAS analysis side is going to always be the way to go because the data is in memory and easy to compute in the AM. Writing a small, reduced dataset out to disk is cheap.

My initial vision of MPAS-Analysis was just that, essentially a 2D and timeseries plotting routine that takes output from MPAS AMs to produce plots. It has obviously evolved into more than that but fundamentally if we can keep that as its bread and butter we should be very successful.

pwolfram commented 7 years ago

@vanroekel, @xylar and @milenaveneziani I agree that we should meet.

xylar commented 7 years ago

My initial vision of MPAS-Analysis was just that, essentially a 2D and timeseries plotting routine that takes output from MPAS AMs to produce plots. It has obviously evolved into more than that but fundamentally if we can keep that as its bread and butter we should be very successful.

So @vanroekel's tests basically prove that this only really works if the resulting data files are also of a reasonable size. If you want to pick out a reasonable amount of data from an unreasonably large file, you apparently spend 1 hour per simulation year doing so. This is where I think we need to work with the ocean team to rethink how timeSeriesStatsMonthly is being written out. Thus, I want to make clear that what I have in mind is not a meeting of the people currently working on MPAS-Analysis but rather anyone working on any aspect of MPAS AMs, so their output can be better designed.

milenaveneziani commented 7 years ago

I see a very interesting discussion here, give me a minute to catch up. Also, I would like to set a meeting for Th to discuss these high-res issues: would that work for you?

xylar commented 7 years ago

There is a Land Ice team meeting from 10 to 11 on Thursday. I'm free before or after.

milenaveneziani commented 7 years ago

ah! I don't see a meeting notes page for that. I guess we should do Mon at 10 then..

xylar commented 7 years ago

It's not an open meeting, so not using the special topic page. Just happens to be at the same time.

On Apr 4, 2017 5:56 PM, "Milena Veneziani" notifications@github.com wrote:

ah! I don't see a meeting notes page for that. I guess we should do Mon at 10 then..

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/MPAS-Dev/MPAS-Analysis/issues/172#issuecomment-291545548, or mute the thread https://github.com/notifications/unsubscribe-auth/AD_EeDTBPf0WATAaJr75XDH0RK6dCjv_ks5rsmglgaJpZM4MyXf6 .

vanroekel commented 7 years ago

18to6 data from year 10 is now on edison

/global/cscratch1/sd/lvroekel/18to6Titan

Permissions should be set, but if you have issues, let me know.

milenaveneziani commented 7 years ago

Special meeting set for this coming Th at 10. So, in my mind, I thought that loading in some light 2d variables (AM output) out of a large file would be OK, but now I understand this is not the case. Solving this problem will definitely help with the timeseries, for which we don't have much choice but to load in the full length of the requested time series data. Instead, for climatologies, we may still have to rely on a different tool to do the computations: either timeSeriesStatsClimatology or the nco-based climo script that Charlie has developed.

xylar commented 7 years ago

Instead, for climatologies, we may still have to rely on a different tool to do the computations: either timeSeriesStatsClimatology or the nco-based climo script that Charlie has developed.

It's not clear to me why nco would perform substantially better than we do. If it does, presumably it's by using threading and we should try to do the same. @pwolfram, you have thought on this, right?

pwolfram commented 7 years ago

@xylar, I think we are asking way too much from an interpreted language without resorting to complicated tricks (e.g., https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Meets_Julia_Micro_Performance?lang=en). A fortran-based approach, particularly one as sophisticated and established as nco, should out perform a python-based approach, even with advanced capabilities such as dask. dask essentially provides threading capabilities but at a high overhead relative to openMP because it is necessarily interpreted, non-typed and not compiled. There is no way around this challenge. dask.distributed was designed to circumvent this by allowing use of multiple nodes (e.g., burn up cycles) in order to get enhanced throughput. This works but it is necessary to say that this is far from the optimal approach for the computation. What it provides, however, is a low-cost programmer-time based approach to open and query data quickly, especially interactively, e.g., see http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3.

There is an even bigger problem that has been disconcerting to me: we really should be doing production compute-heavy operations in MPAS-O, not in python. Python-based approaches, even xarray/dask/dask.distributed, are all about convenient ways to access the data, which may be ok for analysis that is run one time for a paper. For example, on a laptop, it turns out you can get good performance because the cost for data access on a solid-state, local drive is less than on an HPC machine. However, when we are talking about large data and the same analysis approach each time (e.g., non-exploratory) we really should do as many operations as possible while it is still in memory, e.g., MPAS-O Analysis Members. This is especially true for very, very long data sets like ours. This is essentially the exascale challenge: we can only afford to compute on data and we can only afford to load it one time when the simulation is run. We can't afford to do post-processing besides loading very small datasets for plotting purposes.

My recommendation is that we take this approach: do computations in MPAS-O, do plotting (and depending upon the size maybe not even interpolation) in MPAS-Analysis.

One question @milenaveneziani et al,

So, in my mind, I thought that loading in some light 2d variables (AM output) out of a large file would be OK, but now I understand this is not the case.

I think this may still be possible to load and plot a single timeslice on a single node as long as the 2d grid fits into memory. Is there something I'm missing here?

vanroekel commented 7 years ago

@pwolfram @xylar @milenaveneziani Update here. After talking with @pwolfram about updating all packages to conda-forge and the newest versions, the sea-ice time series diagnostic greatly improved performance. From 4 hours previously to 10 minutes with fully updated packages in conda-forge. I don't have any idea why this is the case. The only version truly updated was netcdf4 (1.2.4 to 1.2.7), all others were simply moved to conda-forge builds.

pwolfram commented 7 years ago

This provides support for the need to have a unified python environment *.yml file across the project that @xylar is helping coordinate.

xylar commented 7 years ago

@pwolfram, thanks for clarifying your thinking about performance generally. Clearly we have a lot to discuss tomorrow.

I think this may still be possible to load and plot a single timeslice on a single node as long as the 2d grid fits into memory. Is there something I'm missing here?

Right, but these small fields are part of giant files and I think that's hurting performance a lot. It would be better to store 3D fields in separate files from 2D fields and regionally averaged values separate from either of those 2 so the time to load small data sets was not affected by the presence of big data sets.

pwolfram commented 7 years ago

@xylar I'm fine with this approach (assumed it actually as I wrote that text!) and it seems like the sensible way to handle this challenge. I'm glad to see we are on the same page.

xylar commented 7 years ago

@vanroekel, are there ongoing issues with 18to6 performance that merit leaving this issue open? If not, could you close it?

vanroekel commented 7 years ago

This issue was addressed via an update of conda packages and moving to conda-forge. #177 also helped here.

MPAS-Dev / MPAS-Analysis

18to6 performance #172