SegmentPurger does not handle schema evolution gracefully

apache / pinot

Apache Pinot - A realtime distributed OLAP datastore

https://pinot.apache.org/

Apache License 2.0

5.5k stars 1.29k forks source link

SegmentPurger does not handle schema evolution gracefully #6334

Open mayankshriv opened 3 years ago

mayankshriv commented 3 years ago

We ran into an issue where SegmentPurger failed due to schema evolution as follows:

New columns were added into the schema.
Table index was updated to have inverted index on some of the newly added columns.
An explicit backfill was not performed.

When the SegmentPurger tried to purge older segments, it failed with the following error: java.lang.IllegalStateException: Cannot create inverted index for column: <xxx> because it is not in schema

This is likely because SegmentPurger used the schema in the segment as opposed to the schema in the controller. It would be desirable for SegmentPurger to gracefully handle this scenario.

mcvsubbu commented 3 years ago

If the schema was updated with the new columns, then the schema in the controller would have the new columns right? Perhaps you meant the other way around (i.e. "used the schema in controller as opposed to the schema in the segment") ?

Speaking of which, I think it will be super useful to retain the schema evolution in zookeeper (i.e. versioned schemas with some metadata on when an update was done). It can be used to make decisions such as those by segment purger. In this case, the purger could also have decided to backfill the columns with default values, for example.

mayankshriv commented 3 years ago

No, SegmentPurger uses the table config from controller (to identify that a it needs to build inverted index for a column), but it uses the schema in the segment and does not find the newly added column (as neither segment reload nor backfill happened), and hence the error. Hope this answers your question.