1D output is slow when there are many points

LiamBindle commented 3 years ago

I'm trying to use 1D diagnostics to mimic a SSO satellite overpass diagnostic (sample each grid-box at a constant solar time). It's kludgy, but an old diagnostic in GEOS-Chem Classic, and it's still needed for a couple projects.

The issue I'm running into is it's very slow at high resolution. This 1D diagnostic at C360 brings my simulations throughput from 7 days/day -> 2 days/day. Here is what my track file looks like

$ ncdump -h tropomi_overpass_c360.nc
netcdf tropomi_overpass_c360 {
    dimensions:
    time = 777600 ;
    variables:
    float time(time) ;
        time:_FillValue = NaNf ;
        time:long_name = "time" ;
        time:units = "hours since 1900-01-01 00:00:00" ;
    float longitude(time) ;
        longitude:_FillValue = NaNf ;
        longitude:long_name = "longitude" ;
        longitude:units = "degrees_east" ;
    float latitude(time) ;
        latitude:_FillValue = NaNf ;
        latitude:long_name = "latitude" ;
        latitude:units = "degrees_north" ;
    float nf(time) ;
        nf:_FillValue = NaNf ;
    float Ydim(time) ;
        Ydim:_FillValue = NaNf ;
    float Xdim(time) ;
        Xdim:_FillValue = NaNf ;
}

Note that I use the nf, Ydim, and Xdim variables to ravel the 1D output to C360 in post. :upside_down_face:

Is there anything I could do to speed up this diagnostic? Perhaps, @bena-nasa, do you know?

bena-nasa commented 3 years ago

Liam, I suspect I did this in an inefficient manner. To do the sampling I create an ESMF location stream (i.e. a set of geospatial points). When I do this each point is assigned to a PET (mpi rank). I think for simplicity I put them all on root. This does mean, that the transform I create between the locationstream and the grid to do the sampling essentially is doing everything to root. I did this as a first attempt because it was simple, when it is time for file IO I don't have to worry about where the data is. This is my first guess as to why it is slow. Can you provide the input file you are using for the track in History and the History.rc? I could take a look and see if I think it would be quick or difficult to implement something more efficient.

LiamBindle commented 3 years ago

Hi @bena-nasa, thanks, that makes sense. Here is my track file and HISTORY.rc: track_and_history.zip.

Thanks for taking a look. If there's something I can do on my end, please let me know.

bena-nasa commented 3 years ago

I see what you mean. At C48 running no history vs a single collection sampled with the track you provide was a 2X slowdown in throughput for GEOS. I have one idea to try to see if it helps.

bena-nasa commented 3 years ago

I think I have a tweak that seemed to help a lot in my test. I distribute the points for the locstream across all processors, of course there is no free lunch, it means I have to do a gatherV to get the full 1D locstream field back to root for the time sampling and writing. Still, at least with the file you provided in my limited test performance seems much better. I pushed a branch if you want to try it: https://github.com/GEOS-ESM/MAPL/tree/feature/bmauer%2Foptimize_1d_output

LiamBindle commented 3 years ago

Thanks, @bena-nasa! I'll give it a try and get back to you on Monday.

LiamBindle commented 3 years ago

Hi @bena-nasa, I got a runtime error because global_dst_3d was already allocated. I look briefely at the code and added these deallocate() calls (not sure if correct, but did it quickly so it could run over the weekend)

--- a/base/MAPL_HistoryTrajectoryMod.F90
+++ b/base/MAPL_HistoryTrajectoryMod.F90
@@ -478,6 +478,7 @@ module HistoryTrajectoryMod
                         call this%file_handle%put_var(trim(item%xname),global_dst_2d(interval(1):interval(2)),&
                           start=[this%number_written+1],count=[number_to_write]) 
                      end if
+                     deallocate(global_dst_2d)
                   else if (rank==3) then
                      call ESMF_FieldGet(src_field,farrayptr=p_src_3d,rc=status)
                      _VERIFY(status)
@@ -506,6 +507,7 @@ module HistoryTrajectoryMod
                         call this%file_handle%put_var(trim(item%xname),global_dst_3d(interval(1):interval(2),:),&
                           start=[this%number_written+1,1],count=[number_to_write,size(global_dst_3d,2)])                          
                      end if
+                     deallocate(global_dst_3d)
                   end if
                else if (item%itemType == ItemTypeVector) then
                   _VERIFY(status)

With this update, I see a marginal improvement (1.8 d/d --> 2.5 d/d). Note that the output files are ~13G. Do you have any suggestions? Alternatively, I could make a post-processor that does the SSO overpass diagnostic, but a native solution would be a lot better.

bena-nasa commented 3 years ago

Hi Liam, Yeah, perhaps better don't allocate it if it is already allocated. That was an oversight that apparently intel doesn't choke on.

Is that 13 GB for a single day? A quick back of the envelope calculation the tropomi_overpass file is 7 mb, so naively you would expect each level would take up another 7(2/3)number_of_levels. For example I'm seeing outputting a 2D and 3D variable which is 73 levels for the day and that is about 270 mb (still less than the naive estimate). Not much you can do about that other than having fewer variables per collection.

As for the performance, without more extensive profiling, I'm not sure. I got the idea of trying something else, where I create the locstream and distribute by the grid you are going to sample from as I learned that was an option. So in principle the sampling to the locstream should be all local. But then I had to do a redist to get these back to root for writing and time sampling. The performance was unexpectedly bad, no better than the original which I found a bit surprising.

There maybe be a possibility we might have another use for this in the GMAO so I put in some timers. As I suspected it is the redist that is taking up the majority of the time. So the original implementation was probably slow because everything was being done on root. This latest was slow because it is expensive to get everything back to root in the right order. The optimization you tried was in the middle, a naive distribution of the points across processors, but then you can use a simple MPI_GatherV to get them back to root.

I'm not sure where to go from here. I was doing the naive thing of making a locstream for the entire time series. I could only sample the points in the time interval but then I have to regenerate the locstream and the transform everytime which could get expensive too. But I think that will not be hard to try so worth a shot.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

mathomp4 commented 3 years ago

I think @bena-nasa knows of this, and again this is probably a long term issue, so, I'll label it long-term.

bena-nasa commented 3 years ago

I did take a 2nd look at the particular extreme case Liam referenced where he was noticing a slowdown but I was unable to make any notable improvements.

GEOS-ESM / MAPL

1D output is slow when there are many points #702