NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

[FEA] Cache input data to reduce integration test time #9763

Open ttnghia opened 11 months ago

ttnghia commented 11 months ago

As of now, each running of integration tests can take more than 3 hours (more than 4 hours on databricks). We can consider caching the input data, storing all the randomly generated data in some static map and reusing them throughout the tests. By doing so, we could probably save a significant amount of time.

ttnghia commented 11 months ago

For example, if we have two tests using IntegerGen() then they can share exactly the same input. Special generators such as IntegerGen().with_special_case() need to be considered as different from the pure IntegerGen() through internal object hashing.

thirtiseven commented 11 months ago

Currently we already have cached dataframes: https://github.com/NVIDIA/spark-rapids/blob/30c3df35ab7e87f71ecd84789529d25d9a289848/integration_tests/src/main/python/data_gen.py#L758-L763

And have a _cache_repr in dataGens for hashing, is that the same thing as this issue?

ttnghia commented 11 months ago

Oh wow, I didn't know that. But we still have all tests running very slow. However, such decoration is only caching python lists of values, not Spark dataframes. Is it possible to cache Spark dataframes instead?

thirtiseven commented 11 months ago

However, such decoration is only caching python lists of values, not Spark dataframes. Is it possible to cache Spark dataframes instead?

Yes, it is possible, we are not doing this now because the runtime config in spark session may affect the result dataframes. It will make the IT run slightly faster under my tests. It is tracked by https://github.com/NVIDIA/spark-rapids/issues/8524

The current performance bottleneck of IT seems to be https://github.com/NVIDIA/spark-rapids/issues/8447