[Feature]: Replace image diff checking in integration tests with metrics checking instead

Is your feature request related to a problem?

Currently, tests/integration/test_diags.py runs the all_sets.cfg diagnostics and takes the diffs of the results and compares them against a baseline (whatever is on Chrysalis). We set the minimum diff threshold of non-zero pixels to 2%. The issue with taking a diff of two images is that any noise can break the test (e.g., change in matplotlib formatting, shifting of legend, floating point formatting, different font sizes). The baseline results sometimes need to be updated if matplotlib updates introduce side-effects. It is challenging to debug the integration tests and they take a long time to run (#643), which bogs down development.

For example, below is the actual, expected, and the difference of both. Notice that the diff is basically just noise from the legend shifting over a bit and a change in the "Test" name.

feedback-TREFHT-NINO3-TS-NINO3_actual

Describe the solution you'd like

We should compare the underlying metrics in the .json files instead. Users should manually validate the plots are as expected based on the metrics being plotted since that is a more reliable over pixel comparisons.

Describe alternatives you've considered

No response

Additional context

No response

E3SM-Project / e3sm_diags