Open kevinjqliu opened 6 months ago
Some potential optimizations:
pytest
execution, this requires that each test can be independently runtest_query_filter_only_nulls
are parameterized by many args making the number of tests equal to the args multiplied together. Interesting, @kevinjqliu are you already working on this?
Yeah, I started looking at ways to optimize all the tests. I'm blocked on making tests run in isolation. There's a wip PR #598
Here's what I'm blocked on specifically.
Parallelize this test test_query_filter_appended_null
,
PYTEST_ARGS="-n auto -k test_query_filter_appended_null" /usr/bin/time make test-integration
The problem is the table default.arrow_table_v1_with_null
used has scope="session", autouse=True
, which because of the session scope, can't run in parallel
I would love to check where the time is being spent.
Concerning parallelization, I'm a bit hesitant since I think this will also apply more pressure on the rest catalog. I'm doubtful if it will really speed up the tests, but the numbers will tell.
Making all the tests run in parallel requires lots and lots of changes.
For the short term, the best way forward is to look at individual tests. pytest
's --duration
shows parameterized tests as separate tests. I wrote a script to roll up the parameterized tests
Here are the top 10 tests by duration, group by test name for parameterized tests
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_appended_null_partitioned, 22 tests, took 47.68
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_null_partitioned, 22 tests, took 39.73
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_only_nulls_partitioned, 22 tests, took 29.37
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_v1_v2_append_null, 11 tests, took 22.38
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_without_data_partitioned, 22 tests, took 17.06
tests/integration/test_add_files.py:test_add_files_to_unpartitioned_table, 3 tests, took 13.00
tests/integration/test_partitioning_key.py:test_partition_key, 28 tests, took 6.25
tests/integration/test_rest_schema.py:test_disallowed_updates, 337 tests, took 5.88
tests/integration/test_reads.py:test_ray_nan, 2 tests, took 3.88
tests/integration/test_writes/test_writes.py:test_query_filter_appended_null, 24 tests, took 3.66
Top 5 tests are all from test_partitioned_writes.py
, totaling 156.22 seconds (2.6 minutes).
These tests are heavily parameterized
@pytest.mark.parametrize(
"part_col", ['int', 'bool', 'string', "string_long", "long", "float", "double", "date", "timestamptz", "timestamp", "binary"]
)
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Apache Iceberg version
main (development)
Please describe the bug 🐞
Integration tests feel significantly slower than before.
Running on the latest main branch took
266 seconds
, which is more than 4 minutesCompare this to
pyiceberg-0.6.0rc6
tag, integration tests only took100 seconds
.Here are the top 10 slowest tests: