Open ttnghia opened 11 months ago
For example, if we have two tests using IntegerGen()
then they can share exactly the same input.
Special generators such as IntegerGen().with_special_case()
need to be considered as different from the pure IntegerGen()
through internal object hashing.
Currently we already have cached dataframes: https://github.com/NVIDIA/spark-rapids/blob/30c3df35ab7e87f71ecd84789529d25d9a289848/integration_tests/src/main/python/data_gen.py#L758-L763
And have a _cache_repr
in dataGens for hashing, is that the same thing as this issue?
Oh wow, I didn't know that. But we still have all tests running very slow. However, such decoration is only caching python lists of values, not Spark dataframes. Is it possible to cache Spark dataframes instead?
However, such decoration is only caching python lists of values, not Spark dataframes. Is it possible to cache Spark dataframes instead?
Yes, it is possible, we are not doing this now because the runtime config in spark session may affect the result dataframes. It will make the IT run slightly faster under my tests. It is tracked by https://github.com/NVIDIA/spark-rapids/issues/8524
The current performance bottleneck of IT seems to be https://github.com/NVIDIA/spark-rapids/issues/8447
As of now, each running of integration tests can take more than 3 hours (more than 4 hours on databricks). We can consider caching the input data, storing all the randomly generated data in some static map and reusing them throughout the tests. By doing so, we could probably save a significant amount of time.