apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
434 stars 159 forks source link

Integration tests performance degradation #604

Open kevinjqliu opened 6 months ago

kevinjqliu commented 6 months ago

Apache Iceberg version

main (development)

Please describe the bug 🐞

Integration tests feel significantly slower than before.

make test-integration

Running on the latest main branch took 266 seconds, which is more than 4 minutes

/usr/bin/time make test-integration

=============================================================== 765 passed, 4 skipped, 2415 deselected, 24 warnings in 217.18s (0:03:37) ================================================================
      265.95 real        17.03 user         7.11 sys

Compare this to pyiceberg-0.6.0rc6 tag, integration tests only took 100 seconds.

git checkout pyiceberg-0.6.0rc6
/usr/bin/time make test-integration

============================================================== 508 passed, 4 skipped, 2305 deselected in 48.61s ==============================================================
       99.89 real         9.43 user         4.16 sys

Here are the top 10 slowest tests:

PYTEST_ARGS="--durations=10" make test-integration

============================================================================ slowest 10 durations ============================================================================
5.93s call     tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
3.52s call     tests/integration/test_reads.py::test_ray_nan[catalog_hive]
3.12s setup    tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
2.66s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-int]
2.62s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[int]
2.46s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_appended_null_partitioned[2-binary]
2.35s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-float]
2.26s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[bool]
2.21s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-long]
2.21s call     tests/integration/test_reads.py::test_table_properties_dict[catalog_hive]
kevinjqliu commented 6 months ago

Some potential optimizations:

Gowthami03B commented 6 months ago

Interesting, @kevinjqliu are you already working on this?

kevinjqliu commented 6 months ago

Yeah, I started looking at ways to optimize all the tests. I'm blocked on making tests run in isolation. There's a wip PR #598

kevinjqliu commented 6 months ago

Here's what I'm blocked on specifically.

Parallelize this test test_query_filter_appended_null,

PYTEST_ARGS="-n auto -k test_query_filter_appended_null" /usr/bin/time make test-integration

The problem is the table default.arrow_table_v1_with_null used has scope="session", autouse=True, which because of the session scope, can't run in parallel

Fokko commented 6 months ago

I would love to check where the time is being spent.

Concerning parallelization, I'm a bit hesitant since I think this will also apply more pressure on the rest catalog. I'm doubtful if it will really speed up the tests, but the numbers will tell.

kevinjqliu commented 6 months ago

Making all the tests run in parallel requires lots and lots of changes.

For the short term, the best way forward is to look at individual tests. pytest's --duration shows parameterized tests as separate tests. I wrote a script to roll up the parameterized tests

Here are the top 10 tests by duration, group by test name for parameterized tests

tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_appended_null_partitioned, 22 tests, took 47.68
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_null_partitioned, 22 tests, took 39.73
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_only_nulls_partitioned, 22 tests, took 29.37
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_v1_v2_append_null, 11 tests, took 22.38
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_without_data_partitioned, 22 tests, took 17.06
tests/integration/test_add_files.py:test_add_files_to_unpartitioned_table, 3 tests, took 13.00
tests/integration/test_partitioning_key.py:test_partition_key, 28 tests, took 6.25
tests/integration/test_rest_schema.py:test_disallowed_updates, 337 tests, took 5.88
tests/integration/test_reads.py:test_ray_nan, 2 tests, took 3.88
tests/integration/test_writes/test_writes.py:test_query_filter_appended_null, 24 tests, took 3.66

Top 5 tests are all from test_partitioned_writes.py, totaling 156.22 seconds (2.6 minutes). These tests are heavily parameterized

@pytest.mark.parametrize(
    "part_col", ['int', 'bool', 'string', "string_long", "long", "float", "double", "date", "timestamptz", "timestamp", "binary"]
)
github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.