HypothesisWorks / hypothesis

Hypothesis is a powerful, flexible, and easy to use library for property-based testing.
https://hypothesis.works
Other
7.58k stars 586 forks source link

FailedHealthCheck and slow performance with just 400 exampes of builtin strategies #2641

Closed theoden-dd closed 3 years ago

theoden-dd commented 4 years ago

Maybe the answer is simply "Disable health check then" as for #2195. Just to report a performance reproducer.

I've got a simple profile strategy:

profile_data_lists = st.lists(
    st.fixed_dictionaries({
        'name': st.text(),
        'username': st.text(),
        'id': st.integers(1),
        'file': st.fixed_dictionaries({
            'type': st.sampled_from(ImageFileType.values()),
            'url': urls(),
            'image': st.fixed_dictionaries({
                'small': urls(),
                'big': urls(),
            }),
        }),
        'idol_status': st.sampled_from(IdolStatus.names()),
    }),
    max_size=100,
)

where ImageFileType and IdolStatus are some enums.

With 10 and 100 examples it works just fine. But starting from 400 it generates errors like:

hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 7 valid examples in 1.20 seconds (0 invalid ones and 2 exceeded maximum size). Try decreasing size of the data you're generating (with e.g.max_size or max_leaves parameters).

I wonder why 100 elements in a list became a "large data set".

As a workaround I reduced lists max_size to 50, still a single test with it takes up to 80-90 seconds. The test is simple, the lag is all about this data generation. E.g. with 10 examples it takes 1.45 seconds.

platform linux -- Python 3.8.2, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
hypothesis-5.16.2
Zac-HD commented 4 years ago

Can you share a complete reproducing example (i.e. include the test function), and run --hypothesis-show-statistics to give an idea of the breakdown between generating data and running tests?

FWIW I agree that "hundreds of elements" is not a large dataset, but it's on the large side for Hypothesis to generate, since we're working with about 8K of entropy and might be doing a lot of work to find a minimal example if the test ever fails.

I think my advice will end up being to consider the tradeoff between few large/slow examples and many small/faster examples; which would find most bugs per minute of runtime? If the large ones, disabling the healthcheck is probably the way to go :confused:

theoden-dd commented 4 years ago

What exactly from --hypothesis-show-statistics output is needed?

The test is the simplest for a list of a RESTful resource:

class TestProfileList(object):
    """Test reading the profile list."""

    @given(
        profile_data_lists,
    )
    def test_ok(
        self,
        api_client,
        profile_data_list_example,
    ):
        """Test ok path."""
        with mock_v0_response(profile_data_list_example):
            res = api_client.get(reverse('api_v1:profile-list'))
        assert res.status_code == status.HTTP_200_OK, res.data

Thank you for the general trade-off idea. In my case smaller sizes should work better. I wanted to have long data testing "by the way". :)

theoden-dd commented 4 years ago

Reduced lists max_size to 50, still

hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 8 valid examples in 1.07 seconds (0 invalid ones and 1 exceeded maximum size). Try decreasing size of the data you're generating (with e.g.max_size or max_leaves parameters).

Along with the error statistics flag changed nothing:

---------------------------------------------------------------------------------------------------------------- Hypothesis -----------------------------------------------------------------------------------------------------------------
You can add @seed(270804462227342365511253416120965193394) to this test or run pytest with --hypothesis-seed=270804462227342365511253416120965193394 to reproduce this failure.
=========================================================================================================== Hypothesis Statistics ===========================================================================================================
========================================================================================================= slowest 5 test durations ==========================================================================================================

I'll try to reduce size/examples slightly to get the statistics out.

theoden-dd commented 4 years ago

Reduced max_size to 30, here are the stats:

tests/test_api/v1/test_profile.py::TestProfileList::test_ok:

  - during reuse phase (5.26 seconds):
    - Typical runtimes: 16-255 ms, ~ 88% in data generation
    - 40 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (84.35 seconds):
    - Typical runtimes: 7-289 ms, ~ 88% in data generation
    - 360 passing examples, 0 failing examples, 31 invalid examples

  - Stopped because settings.max_examples=400
theoden-dd commented 4 years ago

The last interesting observation to mention is non-linear growth of the running time. E.g. compare stats for max_size=50:

  - during generate phase (1.10 seconds):
    - Typical runtimes: 9-26 ms, ~ 81% in data generation
    - 20 passing examples, 0 failing examples, 27 invalid examples

  - Stopped because settings.max_examples=20
  - during generate phase (3.15 seconds):
    - Typical runtimes: 8-100 ms, ~ 82% in data generation
    - 25 passing examples, 0 failing examples, 16 invalid examples

  - Stopped because settings.max_examples=25

+25% examples vs +186% time

  - during generate phase (11.38 seconds):
    - Typical runtimes: 9-230 ms, ~ 89% in data generation
    - 30 passing examples, 0 failing examples, 32 invalid examples

  - Stopped because settings.max_examples=30

+20% examples vs +261% time

theoden-dd commented 4 years ago

Well, things've got worse. I don't know if I should create a new issue for this, but the source data and test are the same.

Eventually I went down to 20 max_size x 20 max_examples. Even that spawned "data too large" from time to time. So the current settings are

    @settings(
        # Unreliable test timings! On an initial run, this test took 924.45ms, which exceeded
        # the deadline of 200.00ms, but on a subsequent run it took 5.59 ms, which did not.
        # If you expect this sort of variability in your test timings, consider turning
        # deadlines off for this test by setting deadline=None.
        deadline=None,
        max_examples=20,
        suppress_health_check=[HealthCheck.data_too_large],
    )

But that wasn't the end! Now it complaints:

hypothesis.errors.FailedHealthCheck: Data generation is extremely slow: Only produced 9 valid examples in 1.08 seconds (0 invalid ones and 10 exceeded maximum size). Try decreasing size of the data you're generating (with e.g.max_size or max_leaves parameters).

@Zac-HD , why does hypothesis decide for developer, which example size is too big? Should we add a property to set it per project? Or maybe cancel max size filter when data_too_large is suppressed?

I'm feeling kinda frustrated after I suppressed the data size limit and hypothesis would say: "OK, now I'll still become slower for your big data, anyway!".

Zac-HD commented 3 years ago

Having looked into this again, I don't think there's much that we can do, sorry 😥 The basic problem arises from the combination of a few facts:

So I'd recommend suppressing HealthCheck.too_slow as well, and/or using less complex inputs where possible - the primary culprit here is probably urls(), which are way more complex than you might think.

I wish I had a better solution, but here we are 😕

crypdick commented 3 years ago

We're doing an integration on a pipeline which requires that the input dataset must be a minimum size. Since this integration test is rather slow, we are happy testing it on a single sample (we also unit test separately):

@given(df=pandera_schema.strategy(size=9)
@settings(
    max_examples=1,
    deadline=3000,  # 3s in ms
    suppress_health_check=[
        hypothesis.HealthCheck.function_scoped_fixture,
        hypothesis.HealthCheck.large_base_example,
        hypothesis.HealthCheck.data_too_large])
def long_test(df): ...

As you can see, we've disabled the data_too_large and large_base_example health checks, yet we still get the 0 valid examples [...] (0 invalid ones and 5 exceeded maximum size) error. If I add the too_slow as suggested, the tests take minutes to run.

We are disabling this maximum size check on our (tiny, IMO) dataframes by doing something like df = df_strategy.example() inside the test without using @given or @settings. I timed this line and it takes a couple seconds, much faster than what we were seeing with too_slow checks disabled.

In short, I agree with @theoden-dd that the disabling the data_too_large health check should disable the exceeded maximum size error.

Zac-HD commented 3 years ago

I agree with theoden-dd that the disabling the data_too_large health check should disable the exceeded maximum size error.

Unfortunately the error is telling you that Hypothesis literally ran out of random bytes to parse into your dataframe, and there's not really anything that we can do about that.

I suspect that in your case the underlying problem is actually an ineffiencient implementation in pandera; they way they compose checks with .filter() makes pathological performance issues unfortunately quite common. Additionally, Hypothesis always tries the "all-zeros" minimal example first - so if you're seeing more varied data, the filters are definitely at work :-/

I suggest changing the pipeline to allow fewer rows and/or columns, or opening an issue on Pandera about this performance problem. FWIW while this is too niche for us to fix for free, we also do consulting contracts if this is actually important to your company (and e.g. Stripe was very happy last time we worked with dataframes).

crypdick commented 3 years ago

Ah! If pandera is using rejection sampling then this problem makes total sense.

Zac-HD commented 3 years ago

Exactly 😁

"Improve pandera performance" is definitely on my todo list (albeit a long way from the top), and #2701 should also help.

DrUpir commented 3 years ago

Thank you a lot for an excellent property-based testing library :pray:

Is there any way to bind HealthCheck.too_slow to CPU strokes instead of processor time?

Context: The different results of health checking on the different clock rate CPUs. For example, to reproduce guaranteed locally (Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz) errors on the server with 2200 GHz I used

docker-compose.local.yml
  services:
    service_name:
      deploy:
        resources:
          limits:
            cpus: '0.3'

I decide to write the message here instead of opening a new issue because I think there are any worth reasons to haven't such bindings.