apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.41k stars 1.31k forks source link

Bulk Loading Framework #11369

Closed kakaiu closed 2 months ago

kakaiu commented 5 months ago

We want to significantly improve the bulk loading speed by directly injecting data to storage servers instead of going through transaction system. Therefore, the new bulk load mechanism is expected to fully leverage the parallelism of data distribution system. The new bulk loading feature is heavily implemented as a part of data distribution.

Related issue: https://github.com/apple/foundationdb/issues/1002

When defining bulkLoad behavior, our goal is to avoid users making mistakes when using the tool. A bulk load task involves a specified range and a set of files to be loaded into that range. Users must specify the range to load the data from the input files. Any data outside this range will be ignored. Upon completion of the bulk load task, the injected data will become visible to users. Any pre-existing data within the specified range will be discarded. To prevent user mistakes, we halt all traffic within the range during the bulk loading process (this functionality will be included in a separate PR).

User interface design

We aim to develop a user interface that facilitates the reliable submission of bulk load tasks and simplifies the monitoring of their progress. A key focus of the design is to ensure users are fully aware of their actions. To achieve this, we have embedded several design features. For instance, bulk load tasks remain persisted until users acknowledge their completion.

There are four key operations:

  1. Set bulkLoad mode;
  2. Submit a bulkLoad task;
  3. Get bulkLoad task status;
  4. Acknowledge the completion of a bulkLoad task.

In general, bulk load system provides both FDBCLI and transactional way to conduct the above all operations (This PR only includes FDBCLI based user interface for testing purposes. Transaction based interface will be included in a separate PR).

Setting bulk load mode Both FDBCLI and transactional methods submit the task by setting \xff/bulkLoadMode key. When the mode is turned on, DD restarts and keeps monitoring bulk load tasks submitted to \xff/bulkLoad/ and DD triggers bulk load tasks accordingly. When the mode is turned off, DD restarts and cancels all data moves issued for the bulk load task.

The FDBCLI command to set the bulkLoad mode: bulkload mode <on|off>

Submit a task Both FDBCLI and transactional methods submit the task by setting \xff/bulkLoad/. This key space will be read by DD which triggers task accordingly. It is noted that the input range must be within the user key space (aka. "" ~ \xff).

FDBCLI example to trigger a bulkLoad task: bulkload local 1 2 "/root/sim-load/bulkLoad/1" "/root/sim-load/bulkLoad/1/95af52bf2eddebe59a954db83895fa74-data.sst" "/root/sim-load/bulkLoad/1/51da3c579419558de34f00126efa25a0-bytesample.sst"

Get bulk load task status After a bulk load task is triggered, users can use fdbcli or transactional API to get the status of the task. By providing a range, the client will get all bulk load tasks intersecting the range.

FDBCLI example to get a bulkLoad task status to get all bulk loading of the entire user space: bulkload status "" \xff

Output of the bulk load status:

[Complete]: BulkLoadState: [Range]: 1 - 2, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 4, [Folder]: /root/bulkLoadLargeData/1, [DataFiles]: /root/bulkLoadLargeData/1/1_2_8490786aed2af0dddb7d3f07f389a6ad-data.sst, [SubmitTime]: 1721444374.400875, [TriggerTime]: 1721444374.748643, [StartTime]: 1721444374.754608, [CompleteTime]: 1721444395.008611, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/1/1_2_520c5b91be58e3290d06ad657c4977fd-bytesample.sst, [DataMoveId]: f5d88e9b6314879dd9b1c55013000704, [TaskId]: 00792a12b959887df3452088ccd0f8fd
[Running]: BulkLoadState: [Range]: 2 - 3, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 3, [Folder]: /root/bulkLoadLargeData/2, [DataFiles]: /root/bulkLoadLargeData/2/2_3_410cdcd09dd7a21f09a20f2b9f54c5e7-data.sst, [SubmitTime]: 1721444377.225051, [TriggerTime]: 1721444397.024549, [StartTime]: 1721444397.029248, [CompleteTime]: 0.000000, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/2/2_3_af27630f707cb049196a487f76abcbc5-bytesample.sst, [DataMoveId]: f5ac49a76f5a4c70744d7bf28f1f0704, [TaskId]: 8c406a1648f9147ee2e31a7a983b18b4
[Running]: BulkLoadState: [Range]: 3 - 4, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 3, [Folder]: /root/bulkLoadLargeData/3, [DataFiles]: /root/bulkLoadLargeData/3/3_4_71139ad46eecbe1e078bcfde7feda48d-data.sst, [SubmitTime]: 1721444379.184671, [TriggerTime]: 1721444397.012706, [StartTime]: 1721444397.017751, [CompleteTime]: 0.000000, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/3/3_4_8231c3dc13b1481866977b471be4a260-bytesample.sst, [DataMoveId]: 6c71cd55450ee428d22257ac7c6d0704, [TaskId]: fb98642d309300711c204d0723708783
Finished task count 1 of total 3 tasks

The output shows that there are three bulk loading tasks overlapping the input range. In the status, it includes settings of tasks, such as file names and loading range. Moreover, the task status reports the progress. A bulkLoad task has 4 major stages: (1) submitted; (2) triggered; (3) start running; (4) running complete. The status records the time for each stage when the stage is end. If a stage is not end, the corresponding time is unset. User can use those time to track the progress of a task. The status also includes information for debugging purpose, such as the ID of the data move that is working on the bulk loading task. If a bulk load task is stuck, users can use this data move ID to check the progress of the data move.

Acknowledge the completion of a bulk load task After a bulk load task is completed, users can use fdbcli or transactional API to mark the task as acknowledged on metadata by providing a task ID and a range. DD erases acknowledged tasks' metadata in background.

FDBCLI example to acknowledge completion of a task: bulkload acknowledge 7599f0eadaf7df66222cc540b98bf224 1 2 This command will get the task status of range [1, 2) at first and if the task ID matches, the task metadata will be erased.

Backend design

We aim to develop a bulk loading backend compatible with various storage engines. Therefore, the backend design must be flexible, straightforward, and aligned with the current data distribution architecture. When a user submits a task, it is stored in the bulkLoad system key space as previously described. The Data Distributor (DD) periodically checks for new tasks. Upon detecting a new task, the DD ensures its completion by initiating data moves. These data moves signal storage servers by setting the ServerKey in the system key space. When a storage server (SS) receives a bulk load task, it downloads or copies the necessary files to a local directory and injects the data into the key-value store. Once the data move is completed, the injected data becomes accessible, and the bulk load task completes.

A bulk load task is defined by:

Running a bulk load task has following constraints:

Life cycle of a bulk load task --- 5 stages:

  1. Submitted: A bulk load task is created by users and the task is initialized with BulkLoadPhase::Invalid and a random taskID and all necessary configuration. Then this task is persisted to \xff/bulkLoad/ key space;
  2. Triggered: When DD bulk load mode is enabled, DD periodically takes tasks from \xff/bulkLoad/ key space to handle. When DD starts handling this task, DD persisted the task phase as BulkLoadPhase::Triggered;
  3. Running Started: DD triggers a data move to work on the task. When the data move metadata is persisted, SS starts loading the data for the task. At this point, DD persists the phase as BulkLoadPhase::Running (when updating serverKey and keyServer) of the data move;
  4. Running Complete: When all SSes complete the bulk loading, DD persisted the phase as BulkLoadPhase::Complete (when updating serverKey and keyServer) of the data move. At this point, the data move completes, and the corresponding task completes, and the new data is visible to the client.
  5. Acknowledged: When users observe the completion of a task, users acknowledge it and the task metadata will be cleared by DD in an async way.

Bulk load tasks are persisted on \xff/bulkLoad/ in a form of key range map, therefore any range has at most one bulk loading task at a time. Each task has a task id to distinguish different tasks with the same range.

Key invariants of updating bulk loading task map:

Key bulk loading mechanisms and invariants:

  1. When a bulk load task on a range is observed, DD (DDTracker) splits the shard to get a shard strictly same as the task range and trigger a data move for this shard with healthy priority.
  2. DD marks the range as the bulk loading range. Any shard boundary change is disabled on the loading range. When the bulk load task completes, the boundary change on the range is re-enabled.
  3. After DD triggers a data move for a bulk load task, DD listens on the complete signal channel of the task. DD will get notified once the task completes.
  4. When DD selects a destination team for the bulk load data move, DD randomly chooses an healthy team that has no overlapping data with the source team. This mechanism guarantees that at any time, client reads from consistent data. Random team selection provides a certain level of load balancing.
  5. Bulk loading signal delivers at a destination storage server by a normal path of data move --- via changing serverKeys space. When SS receives a bulk loading signal, it reads the bulk load task from the data move metadata. Then, the SS injects the external files as specified by the bulk load task. Note that if a SS shard is set to do bulk loading, this SS shard must not be a readWrite shard because of the point 4.
  6. When DD restarts, bulk loading data moves are cancelled. This guarantees that any bulk load task cannot be alive across different DDs.

Simulation tests

Currently, in a certain frequent case, the MachineAttrition workload kills too many storage servers such that the simulation is stuck when finding remote destination teams when the number of data center is 6. To avoid this issue, we set generateFearless=false such that the maximum number of data centers is 4. In rare cases, the simulation cannot complete because CC is failed to detect halted DD, therefore no DD is running then. This is caused by the network partition between CC and DD. To avoid this issue, the simulation disables the network partition injection.

100k correctness test with 1 external timeout for sharded rocksdb engine: 20240723-173043-zhewang-ef27ca81eb64aea1 compressed=True data_size=37277237 duration=5947713 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:37:50 sanity=False started=100000 stopped=20240723-190833 submitted=20240723-173043 timeout=5400 username=zhewang

100K bulkLoading tests: 20240723-173130-zhewang-494457bfa21d04cb compressed=True data_size=37310972 duration=24308992 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=2:24:45 sanity=False started=100000 stopped=20240723-195615 submitted=20240723-173130 timeout=5400 username=zhewang

Feature dependency

For the simplicity and performance, this PR relies on the existing infrastructure of physical shard moves and ShardedRocksDB. So, temporarily, the bulk loading feature is only available with ShardedRocksDB engine and with following features enabled:

However, this PR sets up a general framework of bulk loading for any storage engine. So, in the future, we will support RocksDB and SQLite and range-based data moves with setting shard_encode_location_metadata = true only.

Future works

TODO: (1) Disable user traffic when bulkLoad; (2) Add file checksum; (3) Check whether all data are within the input range when injecting the data to storage server; (4) Bulk load cancellation (Clear bulk load task metadata and trigger a new data move on the same range which will be a normal data move without bulk load task); (5) Allow bulk loading on a readWrite SS shard (optional).

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 4 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 4 months ago

Result of foundationdb-pr on Linux CentOS 7