egraphs-good / egglog

egraphs + datalog!
https://egraphs-good.github.io/egglog/
MIT License
458 stars 54 forks source link

Update Python example #444

Closed saulshanabrook closed 4 weeks ago

saulshanabrook commented 1 month ago

In this PR we change the benchmarks to only run long-running examples instead of all of them. This is to reduce variance caused by indeterminacy in the memory allocator, which is overly prominent in short benchmarks (https://github.com/oxc-project/backlog/issues/89).

It also updates the Python example to more accurately reflect the current workload, including optimizing, and turning into a final Python string, all with CSE (so the size of the file is smaller than before). It should output a final string with something like this:

def __fn(X, y):
    assert X.dtype == np.dtype(np.float64)
    assert X.shape == (150, 4,)
    assert np.all(np.isfinite(X))
    assert y.dtype == np.dtype(np.int64)
    assert y.shape == (150,)
    assert set(np.unique(y)) == set((0, 1, 2,))
    _0 = y == np.array(0)
    _1 = np.sum(_0)
    _2 = y == np.array(1)
    _3 = np.sum(_2)
    _4 = y == np.array(2)
    _5 = np.sum(_4)
    _6 = np.array((_1, _3, _5,)).astype(np.dtype(np.float64))
    _7 = _6 / np.array(float(150))
    _8 = np.zeros((3, 4,), dtype=np.dtype(np.float64))
    _9 = np.sum(X[_0], axis=0)
    _10 = _9 / np.array(X[_0].shape[0])
    _8[0, :,] = _10
    _11 = np.sum(X[_2], axis=0)
    _12 = _11 / np.array(X[_2].shape[0])
    _8[1, :,] = _12
    _13 = np.sum(X[_4], axis=0)
    _14 = _13 / np.array(X[_4].shape[0])
    _8[2, :,] = _14
    _15 = _7 @ _8
    _16 = X - _15
    _17 = np.sqrt(np.array(float(1 / 147)))
    _18 = X[_0] - _8[0, :,]
    _19 = X[_2] - _8[1, :,]
    _20 = X[_4] - _8[2, :,]
    _21 = np.concatenate((_18, _19, _20,), axis=0)
    _22 = np.sum(_21, axis=0)
    _23 = _22 / np.array(_21.shape[0])
    _24 = np.expand_dims(_23, 0)
    _25 = _21 - _24
    _26 = np.square(_25)
    _27 = np.sum(_26, axis=0)
    _28 = _27 / np.array(_26.shape[0])
    _29 = np.sqrt(_28)
    _30 = _29 == np.array(0)
    _29[_30] = np.array(float(1))
    _31 = _21 / _29
    _32 = _17 * _31
    _33 = np.linalg.svd(_32, full_matrices=False)
    _34 = _33[1] > np.array(0.0001)
    _35 = _34.astype(np.dtype(np.int32))
    _36 = np.sum(_35)
    _37 = _33[2][:_36, :,] / _29
    _38 = _37.T / _33[1][:_36]
    _39 = np.array(150) * _7
    _40 = _39 * np.array(float(1 / 2))
    _41 = np.sqrt(_40)
    _42 = _8 - _15
    _43 = _41 * _42.T
    _44 = _43.T @ _38
    _45 = np.linalg.svd(_44, full_matrices=False)
    _46 = np.array(0.0001) * _45[1][0]
    _47 = _45[1] > _46
    _48 = _47.astype(np.dtype(np.int32))
    _49 = np.sum(_48)
    _50 = _38 @ _45[2].T[:, :_49,]
    _51 = _16 @ _50
    return _51[:, :2,]
codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #444 will not alter performance

Comparing saulshanabrook:update-python-example (0082005) with main (b0db068)

Summary

βœ… 6 untouched benchmarks

πŸ†• 2 new benchmarks

Benchmarks breakdown

Benchmark main saulshanabrook:update-python-example Change
πŸ†• python_array_optimize N/A 6.6 s N/A
πŸ†• stresstest_large_expr N/A 2.9 s N/A
Alex-Fischman commented 1 month ago

This seems wrong to me. We want to be able to catch regressions in our whole codebase, not just in 6 slow tests, right?

saulshanabrook commented 1 month ago

I think the issue is that the shorter tests have too much variability to be deterministic enough to benchmark. Alternatively we could try a deterministic memory allocator for our benchmarks. There's a thread I link to above which outlines the issue.

My sense is that if a change doesn't effect our larger benchmarks it's probably fine to pull in... Like as long as they are representative enough of all the code.

Alex-Fischman commented 1 month ago

If the large tests are representative then this is a good change; I just didn't think that they were.

Alex-Fischman commented 1 month ago

Looking at the codspeed regressions on a different PR (#448) I'm seeing that a lot of the variance comes from the parser.

yihozhang commented 1 month ago

Aren't most of the regressions from short-running programs uninteresting because many are one-off costs like parsing, compilation, and bookkeeping? We care more about performance-critical components like query/apply/rebuild, which only become prominent in longer-running programs.

yihozhang commented 1 month ago

I think the old python example is a good stress test for the type checker. Maybe we can still keep it but under a different name?

Alex-Fischman commented 1 month ago

I think that we should increase the regression threshold back to 10%: look at this report. Even on a benchmark that takes >10ms, the rebuild method still gets significantly slower. (Or we should fix rebuild to not be randomly slow? Unclear how possible this is)

saulshanabrook commented 1 month ago

Another option is I could "ignore" all benchmarks that aren't long running: https://docs.codspeed.io/features/ignoring-benchmarks/

That way they still get recorded and can be viewed manually but won't influence the reports.

The only downside here is that if we add additional short examples, they will be have to manually added to the ignore list in the UI.

Alex-Fischman commented 1 month ago

That last option seems strictly better than the solution in this PR to me: opt-out vs. opt-in. Hopefully it would motivate us to create better benchmarks in the future.

saulshanabrook commented 1 month ago

That last option seems strictly better than the solution in this PR to me: opt-out vs. opt-in. Hopefully it would motivate us to create better benchmarks in the future.

OK sounds good! I will update this PR to just add the Python benchmark (and keep the old one) and then just ignore the smaller benchmarks for now :)

saulshanabrook commented 4 weeks ago

OK I think this is ready to merge