GenericMappingTools / gmt

The Generic Mapping Tools
https://www.generic-mapping-tools.org
Other
853 stars 357 forks source link

The pros and cons of the GMT grid padding #4358

Open PaulWessel opened 4 years ago

PaulWessel commented 4 years ago

The GMT team will be discussing the perceived problems of the GMT padded grids, especially in the context of external environments passing grids in and out of GMT. This issue tries to explain what the pad is, why it is there, what it buys us, and what the concerns may be, as well as solutions.

What is the grid padding?

Before the birth of the external tools built on top of GMT (GMT/MEX, PyGMT, Julia), we made the decision that internally, GMT grids should be stored in memory as a 2-D matrix that is extended by two extra rows at the top and bottom, and two extra columns at the left and right sides. Thus, a data grid of size ny by nx is stored as a matrix of size my by mx, where typically mx = nx + 4 and my = ny + 4. Generally designed to deal with files, GMT will read in a file (which of course has no padding) and place it into the padded array. Once it is time for writing the output to a new file, we only write out the original size and the pad is never written to file.

Why is there a pad?

In one word, coding simplicity. Many GMT grid operations, but not all, involve algorithms that require a grid to have boundary conditions of second order. This means the grid z(x,y) has to meet two conditions on each side. These boundary conditions can be translated into the values of the two artificial nodes beyond the boundary. Then, when the algorithm of interest sweeps over the interior matrix, it can access nodes that are one or two steps away in any direction from all the interior nodes. This simplifies coding as the alternative would be separate loops for the sides with modified algorithms that builds the boundary conditions into the operations. In GMT, we simply have one function that can be called to set the padding values from the boundary condition settings (should there not be data available). Note that the padding, while typically 2, can be changed internally, and there are places where a pad of 1 is used.

What are the boundary conditions?

GMT supports both natural, periodic, geographic, and data boundary conditions. Natural means the curvature at the edge is zero, while periodic means we have a repeating boundary (360 longitude wrapping is such a boundary condition, where lon = 361 is same as lon = 1). Geographic is a special mix of periodic in longitude but pole-wrapping in latitude (as we go over the pole along a meridian, longitude changes by 180). Finally, data boundary condition means we are extracting a subset of a data grid, but since it is a subset (and hence data beyond it are available) we can use real data to fill in the pad instead of applying some mathematical condition. Now, when a derivative is required at the boundary it will be data-based. This is clearly superior to any other scheme.

OK, make sense, so what is the problem?

Once the basic input/output of grids in GMT are not files but memory references to grids maintained by Matlab, Python, or Julia, then two situations arise that require attention. Input: Receiving a grid from an external environment is the equivalent of reading a grid file, except now there may be no padding associated with the grid. If the module that receives this pad-less grid actually requires padding for its algorithm to work, and the input is read-only memory, we must duplicate the grid and place it in a new padded matrix and do the work on that padded matrix. This increases memory requirements. Output: At the point of writing the output, the internal matrix is typically padded, but the receiver (say, Julia or Python) has no need for a pad nor expects it. Thus, sending padded output from GMT to externals is generally not appreciated and leads to coding complications for the external developer.

Which GMT modules require a grid pad?

While some modules that create a grid from other input (e.g., surface) needs padding, here I am only concerned with the modules that take grid input, since that is where the padding issue will be present. These include grdimage and grdview (for projections to properly fill new nodes near boundaries), grdproject (same reason), grdsample (2-D bicubic sampling ), grdtrack (involves 2-D bicubic sampling). I am not sure if grdcontour may also be in this category, and there may be others [forcing _GMT_CreateSession to set pad to 0, recompile, and run tests might tell us more!]. Also, note that some GMT modules call other modules, for instance grdblend may call grdsample, and thus inherits the need for a pad. This does not change the conclusions that (a) many grid-reading modules do not require boundary conditions (e.g., grdinfo, grdcut, grdclip, etc.) and (b) those that do really require it and rewriting the code to not work on matrices with pads would be extensive, bug-prone, and not likely a good investment of time.

So how are things working now with memory grids coming in?

While we are still making adjustments to deal with this relatively new scenario, by and large GMT itself is coping with the input. Those modules that must have padding will now check if input has a pad and if not we do the data duplication/expansion, work on the temporary matrix instead, and free it when done with it. This is largely unnoticeable by the user as we are not talking huge amounts of memory unless the input is truly large and the user's environment is memory-limited.

How are things with externals receiving GMT-produced grids by memory?

I suspect this is where there is room for improvement. Regardless of the need to use padding internally to compute results, there seems to be no good reason to return a padded array back to the calling environment. Thus, when _GMT_WriteData is called and the output is a grid destined fo a memory location, the pad should always be stripped off. If that was true then the external user would never notice the padding issue.

Are there downsides of not having pads outside GMT?

Yes there are. Consider an external tool that extracts the subregion 0/20/0/20 from a global topography grid, then passes that in memory via a call to grdtrack to sample it along lines that crosses the domain. Because of no padding, grdtrack has to apply natural boundary conditions during the interpolation along tracks. The interpolated values very close to the grid boundary will generally differ from what the GMT command-line interface (CLI) equivalent reading that subset would do since the pad would have actual data, not pseudo-values derived from the natural boundary condition. So the CLI result and the external call to grdtrack would differ in subtle ways, with the CLI being the correct result. Of course, an informed user could compensate and ask for -1/22/-2/22 as the subset, do the interpolation, and then crop to the desired window, but this involves more work as well as considerable awareness of the effects we have discussed. With the GMT developers historically being very concerned with "doing the right thing" and "screw people with low-RAM computers", you can see why we rather want an exact result than an almost exact one.

Hopefully the @GenericMappingTools/core can comment on this, and if something is incorrect or missing please let me know so I can update this description.

WalterHFSmith commented 4 years ago

I think this is good and clear. Thanks. W

On Oct 22, 2020, at 4:45 PM, Paul Wessel notifications@github.com wrote:

The GMT team will be discussing the perceived problems of the GMT padded grids, especially in the context of external environments passing grids in and out of GMT. This issue tries to explain what the pad is, why it is there, what it buys us, and what the concerns may be, as well as solutions.

What is the grid padding?

Before the birth of the external tools built on top of GMT (GMT/MEX, PyGMT, Julia), we made the decision that internally, GMT grids should be stored in memory as a 2-D matrix that is extended by two extra rows at the top and bottom, and two extra columns at the left and right sides. Thus, a data grid of size ny by nx is stored as a matrix of size my by mx, where typically mx = nx + 4 and my = ny + 4. Generally designed to deal with files, GMT will read in a file (which of course has no padding) and place it into the padded array. Once it is time for writing the output to a new file, we only write out the original size and the pad is never written to file.

Why is there a pad?

In one word, coding simplicity. Many GMT grid operations, but not all, involve algorithms that require a grid to have boundary conditions of second order. This means the grid z(x,y) has to meet two conditions on each side. These boundary conditions can be translated into the values of the two artificial nodes beyond the boundary. Then, when the algorithm of interest sweeps over the interior matrix, it can access nodes that are one or two steps away in any direction from all the interior nodes. This simplifies coding as the alternative would be separate loops for the sides with modified algorithms that builds the boundary conditions into the operations. In GMT, we simply have one function that can be called to set the padding values from the boundary condition settings (should there not be data available). Note that the padding, while typically 2, can be changed internally, and there are places where a pad of 1 is used.

What are the boundary conditions?

GMT supports both natural, periodic, geographic, and data boundary conditions. Natural means the curvature at the edge is zero, while periodic means we have a repeating boundary (360 longitude wrapping is such a boundary condition, where lon = 361 is same as lon = 1). Geographic is a special mix of periodic in longitude but pole-wrapping in latitude (as we go over the pole along a meridian, longitude changes by 180). Finally, data boundary condition means we are extracting a subset of a data grid, but since it is a subset (and hence data beyond it are available) we can use real data to fill in the pad instead of applying some mathematical condition. Now, when a derivative is required at the boundary it will be data-based. This is clearly superior to any other scheme.

OK, make sense, so what is the problem?

Once the basic input/output of grids in GMT are not files but memory references to grids maintained by Matlab, Python, or Julia, then two situations arise that require attention. Input: Receiving a grid from an external environment is the equivalent of reading a grid file, except now there may be no padding associated with the grid. If the module that receives this pad-less grid actually requires padding for its algorithm to work, and the input is read-only memory, we must duplicate the grid and place it in a new padded matrix and do the work on that padded matrix. This increases memory requirements. Output: At the point of writing the output, the internal matrix is typically padded, but the receiver (say, Julia or Python) has no need for a pad nor expects it. Thus, sending padded output from GMT to externals is generally not appreciated and leads to coding complications for the external developer.

Which GMT modules require a grid pad?

While some modules that create a grid from other input (e.g., surface) needs padding, here I am only concerned with the modules that take grid input, since that is where the padding issue will be present. These include grdimage and grdview (for projections to properly fill new nodes near boundaries), grdproject (same reason), grdsample (2-D bicubic sampling ), grdtrack (involves 2-D bicubic sampling). I am not sure if grdcontour may also be in this category, and there may be others [forcing GMT_Create_Session to set pad to 0, recompile, and run tests might tell us more!]. Also, note that some GMT modules call other modules, for instance grdblend may call grdsample, and thus inherits the need for a pad. This does not change the conclusions that (a) many grid-reading modules do not require boundary conditions (e.g., grdinfo, grdcut, grdclip, etc.) and (b) those that do really require it and rewriting the code to not work on matrices with pads would be extensive, bug-prone, and not likely a good investment of time.

So how are things working now with memory grids coming in?

While we are still making adjustments to deal with this relatively new scenario, by and large GMT itself is coping with the input. Those modules that must have padding will now check if input has a pad and if not we do the data duplication/expansion, work on the temporary matrix instead, and free it when done with it. This is largely unnoticeable by the user as we are not talking huge amounts of memory unless the input is truly large and the user's environment is memory-limited.

How are things with externals receiving GMT-produced grids by memory?

I suspect this is where there is room for improvement. Regardless of the need to use padding internally to compute results, there seems to be no good reason to return a padded array back to the calling environment. Thus, when GMT_Write_Data is called and the output is a grid destined fo a memory location, the pad should always be stripped off. If that was true then the external user would never notice the padding issue.

Are there downsides of not having pads outside GMT?

Yes there are. Consider an external tool that extracts the subregion 0/20/0/20 from a global topography grid, then passes that in memory via a call to grdtrack to sample it along lines that crosses the domain. Because of no padding, grdtrack has to apply natural boundary conditions during the interpolation along tracks. The interpolated values very close to the grid boundary will generally differ from what the GMT command-line interface (CLI) equivalent reading that subset would do since the pad would have actual data, not pseudo-values derived from the natural boundary condition. So the CLI result and the external call to grdtrack would differ in subtle ways, with the CLI being the correct result. Of course, an informed user could compensate and ask for -1/22/-2/22 as the subset, do the interpolation, and then crop to the desired window, but this involves more work as well as considerable awareness of the effects we have discussed. With the GMT developers historically being very concerned with "doing the right thing" and "screw people with low-RAM computers", you can see why we rather want an exact result than an almost exact one.

Hopefully the @GenericMappingTools/core can comment on this, and if something is incorrect or missing please let me know so I can update this description.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions.