apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.31k stars 1.24k forks source link

Segment reload in upsert table took too much time and caused disruption #7569

Open chenboat opened 2 years ago

chenboat commented 2 years ago

Reloading a segment in a Pinot upsert table needs around 20 seconds during a table segment reload (see timeline below). In a server with hundreds of segments, the reload of all segments could take hours and causes disruption.

As a comparison, if we turn off the upsert feature for the same realtime table, a reload of segment of the same size needs just 0.2 second which is 50 times faster.

@Jackie-Jiang @yupeng9 @icefury71

chenboat commented 2 years ago

As can be seen from the event sequence below, the follow call Removing upsert metadata for segment: took 20s to finish and is the main culprit of the slow reload.

In segment reload, an existing segment A is replicated to another copy A' to continue serving. When the segment finishes reloading, A' is destroyed. A upsert table metadata manager needs to delete all the references to the old segment in the upsert metadata table _primaryKeyToRecordLocationMap. This step right now takes too long for finish and should be optimized. One interesting note is that during segment reload, in fact none of the segment location is changed logically and they can be left untouched.

Event timestamp
Closed segment: rta_ads_metrics372__20210611T1841Z of table: rta_ads_me... 2021-10-12T22:01:04.719+00:00
Reloaded segment: rta_ads_metrics372__20210611T1841Z in table: rtaads... 2021-10-12T22:01:04.719+00:00
Removing upsert metadata for segment: rta_ads_metrics372__20210611T1841... 2021-10-12T22:00:42.629+00:00
Closing segment: rta_ads_metrics372__20210611T1841Z of table: rta_ads_m... 2021-10-12T22:00:42.628+00:00
Adding immutable segment: rta_ads_metrics372__20210611T1841Z to table: ... 2021-10-12T22:00:42.628+00:00
Replaced immutable segment: rta_ads_metrics372__20210611T1841Z of table... 2021-10-12T22:00:42.628+00:00
Trying to destroy segment : rta_ads_metrics372__20210611T1841Z 2021-10-12T22:00:42.628+00:00
Adding upsert metadata for segment: rta_ads_metrics372__20210611T1841Z 2021-10-12T22:00:41.431+00:00
Successfully loaded segment rta_ads_metrics372__20210611T1841Z with con... 2021-10-12T22:00:41.431+00:00
Found inverted index for segment: rta_ads_metrics372__20210611T1841Z, c... 2021-10-12T22:00:41.426+00:00
Found inverted index for segment: rta_ads_metrics372__20210611T1841Z, c... 2021-10-12T22:00:41.426+00:00
Found inverted index for segment: rta_ads_metrics372__20210611T1841Z, c... 2021-10-12T22:00:41.426+00:00
Reloading segment: rta_ads_metrics372__20210611T1841Z in table: rta_ads... 2021-10-12T22:00:41.378+00:00
Jackie-Jiang commented 2 years ago

@chenboat Great findings! Can you please also check the size of the _primaryKeyToRecordLocationMap? Current algorithm loops over all the entries within the map, and could be slow when the map is huge.

yupeng9 commented 2 years ago

That's very good insights, Thanks @chenboat. The map size is about 1.6 billion for 8 partitions. So each partition and the metadata manager has about 200million per map. I wonder if we can make the deletion asynchronous and does not block the restart? @Jackie-Jiang

Jackie-Jiang commented 2 years ago

@yupeng9 I feel we have to delete all the entries before the segment is loaded again to ensure correctness. One thing we can optimize is that we can remove the map entries using the segment validDocIds if the map size is larger than the validDocIds

yupeng9 commented 2 years ago

Could you elaborate on this? Also, I feel 200million entries taking 20s is a bit long to me. @richardstartin any insight?

Jackie-Jiang commented 2 years ago

From the validDocIds of a segment, we know the records that are valid within the map, and only remove these primary keys instead of scanning over every entry within the map

yupeng9 commented 2 years ago

I see. Is there an efficient implementation of this? Overlay the bitmap on the map

richardstartin commented 2 years ago

Can we profile it? e.g. install async-profiler from here and set -agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=cpu.html then run lots of segment reloads? This should cut out some guesswork.

yupeng9 commented 2 years ago

High-level question: why do we need to remove segment during reload? I feel we can simply load segments one by one without removing any segment, which can skip this expensive step shown in the log? @Jackie-Jiang @chenboat

chenboat commented 2 years ago

@yupeng9 The segment removal happens in the following step during adding a segment:

https://github.com/apache/pinot/blob/c9e36dd0a53370d7b8dc24459426dc77497eb8ac/pinot-core/src/main/java/org/apache/pinot/core/data/manager/BaseTableDataManager.java#L157

In segment reload, a replica of a segment is in fact added using the above step.

yupeng9 commented 2 years ago

okay, I think the issue could be on the iteration of the entire map https://github.com/apache/pinot/blob/7e9ca6a5a4afe0d4e283ac1307c45430e474cbf2/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/upsert/PartitionUpsertMetadataManager.java#L243

 _primaryKeyToRecordLocationMap.forEach((primaryKey, recordLocation) -> {
        if (recordLocation.getSegment() == segment) {
          // Check and remove to prevent removing the key that is just updated
          _primaryKeyToRecordLocationMap.remove(primaryKey, recordLocation);
        }
      });

Perhaps we could build an index of segment-> PKs, or change _primaryKeyToRecordLocationMap to a sorted map?

chenboat commented 2 years ago

Here is my followup investigation: (1) There is no unit test on the following method for a normal Pinot table let alone an upsert table. We should add unit tests to verify the behaviors of both types of tables:

https://github.com/apache/pinot/blob/c9e36dd0a53370d7b8dc24459426dc77497eb8ac/pinot-server/src/main/java/org/apache/pinot/server/starter/helix/HelixInstanceDataManager.java#L195

(2) Similarly the addSegment() method in RealtimeTableDataManager is not unit-tested in the case of upsert table segment reload: https://github.com/apache/pinot/blob/a5f3dc507e6441baca35dae5bbfad122356683b6/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java#L358-L363

For an upsert table segment reload, today's impl first adds the segment upsert metadata (line 359-360) and then removes the upsert metadata (line 362) from the same segment copy. This sounds a redundant sequence of operations. I can not understand if the final metadata hashmap state for upsert table is correct. A unit test should verify that and lead to possible optimization here.

hpvd commented 1 month ago

Is this is issue solved with merged Upsert enhancements or by any other of the upsert changes from the last 2 years? Or are there open things to be addressed?