gchq / Gaffer

A large-scale entity and relation database supporting aggregation of properties
Apache License 2.0
1.77k stars 353 forks source link

Optimise SortFullGroup in ParquetStore by using repartitionByRange #1916

Closed gaffer01 closed 2 years ago

gaffer01 commented 6 years ago

Spark 2.3 introduced a repartitionByRange option on dataframes. This could be used to improve the efficiency of SortFullGroup in the Parquet store (possibly avoiding the need to use RDDs, which could significantly improve the efficiency).

This requires #1902 to be merged first.

n3101 commented 2 years ago

We are not supporting parquet in v2.0, so this will not be done.