Rethink the testing mechanism for images

seisman commented 3 years ago

If you're unclear about how PyGMT tests images, please read the "Testing plots" section in the contributing guides first.

In short, for image-based tests, we need to specify the baseline/reference image. When we make any changes to the code, we can generate the new "test" image and compare it with the "baseline" image. If the two images are different, then we know the changes break the tests. The most important thing is, to ensure that the "baseline" images are correct.

Currently, we have two different methods to generate the "baseline" image and compare them:

using the decorator @pytest.mark.mpl_image_compare
using the decorator @check_figures_equal()

The @pytest.mark.mpl_image_compare method is the most straightforward way to do image testing. Using the decorator, we need to generate baseline images, check their correctness, and store them in the repository (https://github.com/GenericMappingTools/pygmt/tree/master/pygmt/tests/baseline).

Pros:

We can visually check the baseline images to make sure they are correct

Cons:

Have to store the static PNG images in the repository. The repository size grows quickly.

To avoid storing many large static images in the repository, we (mainly @weiji14 and @seisman) had some discussions (in #451, #522) and developed the @check_figures_equal decorator (#555, #590, #600).

Below is an example test using the @check_figures_equal() decorator:

https://github.com/GenericMappingTools/pygmt/blob/e0579275d00c7a69d25a82b4c8936e0b0b1b7577/pygmt/tests/test_basemap.py#L67-L77

In this example, the baseline/reference image fig_ref is generated using basemap(R="0/360/0/1000", J="P6i", B="afg"), while the test image fig_test is generated using basemap(region=[0, 360, 0, 1000], projection="P6i", frame="afg"). We can't see what the baseline image looks like, but we're somehow confident that the baseline image is correct, because the basemap wrapper is very simple.

Pros:

Don't need to store static images in the repository, thus keep the repository size small

Cons:

For each test, we have to generate two images (baseline and test images), which doubles the execution time
We can't visually check the correctness of the baseline images
If we decided to disable single-character parameters (i.e, J="X10c/10c" is disallowed) as proposed in #262 (also related to #256), then most of the code for generating reference images will be invalid.

For some complicated wrappers, we even can't easily know if the reference image is correct. For example,

https://github.com/GenericMappingTools/pygmt/blob/e0579275d00c7a69d25a82b4c8936e0b0b1b7577/pygmt/tests/test_subplot.py#L30-L42

In this test, we expect that the baseline image has a 2-row-by-1-column subplot layout. However, if we make a silly mistake in Figure.subplot, resulting in a 1-row-by-2-column layout, the test still passes, because both the baseline and test images have the same "wrong" layout. Then the test is useless to us.

Almost every plotting tools have to decide if they want to store static images in the repository. There are some similar discussions in the upstream GMT project (https://github.com/GenericMappingTools/gmt/issues/3470) and the matplotlib project (https://github.com/matplotlib/matplotlib/issues/16447).

As we're having more active developers now, I think we should rethink how we want to test PyGMT.

willschlitzer commented 3 years ago

I'm a fan of switching to @check_figures_equal(), although I understand the difficulties associated with having our testing functions make so many plots. I mostly like the idea of not having a large repository of reference plots, especially for when we have changes in the plotting function, like in GMT 6.2.

That being said, I have begun to realize that it doesn't seem like we're always effectively testing all that much when we're just passing the same parameter to single-letter and PyGMT arguments and testing that the plots turn out the same. As we've discussed in #771, the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases (which includes many of the tests I've written) and focus on the Python parts (my recent example is testing the Python parts of solar to make sure the terminator and datetime inputs produce the same result as a GMT string passed to the T parameter).

seisman commented 3 years ago

the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases and focus on the Python parts.

Yes, I agree. This is what we must do.

Still, the biggest challenge is "how to make sure that baseline images are correct.".

We had some discussions in #451.

One solution is storing static images in a separate repository (e.g. pygmt_baseline_image), but it is too complicated when we want to add a new test.

Should we start using submodules? I.e. split the pygmt/tests/baseline folder into a separate repository (see https://docs.github.com/en/github/using-git/splitting-a-subfolder-out-into-a-new-repository).

I've been reading up on git submodules/subtrees/git-lfs and there doesn't seem to be an easy way to do this, there will be a learning curve in any case. Matplotlib currently has a big PR at https://github.com/matplotlib/matplotlib/pull/17557 to move their baseline images into a separate place, and I really do not want myself or anyone to handle that in X years.

It's too complicated. When we add a test, we have to open two separate PRs in two repositories, one for the baseline images and one for the tests. How can the tests PR know it should get the new baseline images in the corresponding branch?

Yeah, and I don't think it will be friendly for new contributors either. Surely there must be a better way to store the images, or test them

Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.

For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,
fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])
should be identical to the reference image generated by:
lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

willschlitzer commented 3 years ago

Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.
For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,
fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])
should be identical to the reference image generated by:
lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

I like this idea. I think it more effectively tests that the aliases and Python functions line up with the expected outcome from GMT, as opposed to seeing if passing the same arguments to PyGMT twice will produce different results. We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image. The downside is that it will expect someone to learn GMT commands, but I don't think this is too advanced from someone wrapping a new module.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

How would this work with @check_figures_equal()? The GMT 6.2 release seems like an ideal time to make this switch, since we will have to either update the tests or update the reference images.

seisman commented 3 years ago

We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image.

Yes, it sounds reasonable and valid assumption.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

It can also be applied to grid tests.

How would this work with @check_figures_equal()?

https://github.com/GenericMappingTools/pygmt/blob/e0579275d00c7a69d25a82b4c8936e0b0b1b7577/pygmt/tests/test_grd2cpt.py#L23-L37

Just take this test (written by you) as an example, I think I mentioned before that the test may still pass even if grd2cpt() doesn't work as expected. The test can be rewritten to:

from pygmt.clib import Session

@check_figures_equal()
def test_grd2cpt(grid):
    """
    Test creating a CPT with grd2cpt to create a CPT based off a grid input and
    plot it with a color bar.
    """
    # reference image
    fig_ref = Figure()
    with Session() as lib:
        lib.call_module("basemap", "-Ba -JW0/15c -Rd")
        lib.call_module("grd2cpt", "@earth_relief_01d")
        lib.call_module("colorbar", "-Ba2000")

    # test image
    fig_test = Figure()
    fig_test.basemap(frame="a", projection="W0/15c", region="d")
    grd2cpt(grid=grid)
    fig_test.colorbar(frame="a2000")

    return fig_ref, fig_test

willschlitzer commented 3 years ago

@GenericMappingTools/python-contributors Does anyone have opinions on this? I'm in support of @seisman's example using with Session() as lib.

core-man commented 3 years ago

As a new PyGMT and an old GMT user, it seems that the test method by with Session() as lib is better.

If the right figure cannot be generated by PyGMT, I guess there are two possible reasons: 1) some bugs exist in PyGMT; 2) GMT has some bugs. We should fix the first one in PyGMT, while we should report GMT bugs to upstream.

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

seisman commented 3 years ago

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (https://github.com/GenericMappingTools/pygmt/pull/923) function, we can call low-level gmt plot commands to generate the reference images.

michaelgrund commented 3 years ago

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (#923) function, we can call low-level gmt plot commands to generate the reference images.

You're right, so far I made no use in the tests to compare the images as you suggested @seisman. Hopefully I have time to work on this the upcoming weekend.

willschlitzer commented 3 years ago

@seisman Should we begin working on rewriting the tests, or should we wait until the GMT 6.2 release? I'm assuming we want to prioritize the rewriting the tests that use @pytest.mark.mpl_image_compare, but will ideally update the @check_figures_equal tests to use with Session as lib()?

weiji14 commented 3 years ago

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at https://github.com/GenericMappingTools/pygmt/issues/451#issuecomment-655874898 about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

And yes, all this should be done closer to the GMT 6.2.0 release. We have the GMT dev tests set up on that CI for that matter and should be able to fix most tests before the actual GMT 6.2.0 package is out on conda-forge.

seisman commented 3 years ago

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at #451 (comment) about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

I agree that @pytest.mark.mpl_image_compare is the most accurate way to compare reference and test images. As you said, if we can store images offsite, @pytest.mark.mpl_image_compare is the best solution.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

Unfortunately, GMT doesn't support SVG anymore (because recent Ghostscript versions drop the SVG support). Even GMT can save figures in SVG formats, I doubt that it may still not work. The GMT project stores PS files (ASCII) in the repository, and the repository size still grows quickly, because images (especially figures generated by grdimage) are saved as binary data in ASCII PS files (Not sure if I explain it clearly, but you can plot an image using grdimage and open it using your text editor.).

weiji14 commented 3 years ago

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

The hash of the images will be stored in a .dvc file. To download/update the PNG images locally, use dvc pull (similar to git pull). Adding files would be through dvc add (similar to git add). In fact, most of the dvc commands are based on git (see also https://realpython.com/python-data-version-control/), so the learning curve shouldn't be too steep (hopefully). They also have a Python API at https://dvc.org/doc/api-reference we could plug into.

I'll probably need to open up a demo PR to illustrate how things would work, but things we'll need to do are:

Think about where to store the PNG images in the cloud.
Change our testing workflow (for plotting images) back to @pytest.mark.mpl_image_compare

seisman commented 3 years ago

I'll probably need to open up a demo PR to illustrate how things would work

Yes, that would be better.

Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

weiji14 commented 3 years ago

Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

Will need to have a think about where to store things as I create that PR. Probably won't have time to do this until v0.4.0 though.

Edit: Just mirrored the PyGMT repo at https://dagshub.com/GenericMappingTools/pygmt. DAGsHub is a web platform for data version control (see FAQ). Give me a few days or weeks and I'll try and get a pipeline of some sort set-up for us to start uploading images!

weiji14 commented 3 years ago

Ok, #1036 has been merged which sets up data version control (dvc) for the PyGMT repo. The new dvc based workflow addresses the main con of using @pytest.mark.mpl_image_compare (storing large images in the repo) by storing them on https://dagshub.com/GenericMappingTools/pygmt instead (please create an account there everyone using your GitHub login).

We will slowly migrate the tests from @check_figures_equal to @pytest.mark.mpl_image_compare. Instructions are documented at https://github.com/GenericMappingTools/pygmt/blob/master/CONTRIBUTING.md#using-data-version-control-dvc-to-manage-test-images. The test migration will proceed as follows over the next few weeks/months:

After this PR #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Bump minimum GMT version to 6.2.0

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in #835/#840)

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close https://github.com/matplotlib/pytest-mpl/pull/95?

Write new tests for new functionality using @pytest.mark.mpl_image_compare only

_Originally posted by @weiji14 in https://github.com/GenericMappingTools/pygmt/pull/1036#discussion_r595519351_

Update the install instructions, because pygmt.test() won't work for users.

Maybe add the baseline images as a release asset when making releases.

_Originally posted by @seisman in https://github.com/GenericMappingTools/pygmt/pull/1036#discussion_r595545462_

I'd encourage everyone to use for their open PRs when creating test images, and feel free to ask any questions if things are unclear!

maxrjones commented 3 years ago

Wow, @weiji14 and @seisman this looks fantastic! Great job! I'm excited to try it out and find out whether it could be a solution for core-gmt's testing woes as well :smile:

seisman commented 3 years ago

@weiji14 Perhaps you could open a separate issue or several issues with a list of TODOs so that people who want to help have a better idea of what to do.

After this PR #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Bump minimum GMT version to 6.2.0

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in #835/#840)

Question: Do we want to do the migration before GMT 6.2.0 or after? I prefer to do the migration before v6.2.0, although it means more work for us. After bumping GMT to 6.2.0, most tests will fail due to the changes in GMT 6.2.0, but I feel it's also a good opportunity for us to learn the GMT changes and find potential bugs by comparing the images generated by GMT 6.1.1 and 6.2.0.

FYI, one week ago, I built the PyGMT documentation using GMT 6.2.0, and found several issues with the GMT dev version (https://github.com/GenericMappingTools/gmt/issues/4955), and they were all fixed in less than one week!

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close https://github.com/matplotlib/pytest-mpl/pull/95?

I think we may still need @check_figures_equal() in some cases, especially for grid plotting functions. You may remember that we found some upstream bugs by checking if the images generated from a file and a xarray.DataArray are the same. Sometimes upstream bugs can cause tiny differences between these two images and are difficult to identify by visually checking baseline images.

maxrjones commented 3 years ago

@weiji14, could I please get added on DAGshub?

weiji14 commented 3 years ago

@weiji14, could I please get added on DAGshub?

Ok done

seisman commented 3 years ago

After this PR #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Done in #1108.

Bump minimum GMT version to 6.2.0

Waiting for the GMT v6.2.0 release.

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Tracked by issue #1131.

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in #835/#840)

Tracked by issue #1131.

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close matplotlib/pytest-mpl#95?

I think we still need it when testing grids.

Write new tests for new functionality using @pytest.mark.mpl_image_compare only

Yes, it's already documented in the contributing guides.

Update the install instructions, because pygmt.test() won't work for users.

I just opened issue #1200 for discussions.

Maybe add the baseline images as a release asset when making releases.

I just opened issue #1201 for discussions.

seisman commented 3 years ago

I think we can close the issue.

GenericMappingTools / pygmt

Rethink the testing mechanism for images #963