kakaiu commented 5 months ago

We want to significantly improve the bulk loading speed by directly injecting data to storage servers instead of going through transaction system. Therefore, the new bulk load mechanism is expected to fully leverage the parallelism of data distribution system. The new bulk loading feature is heavily implemented as a part of data distribution.

When defining bulkLoad behavior, our goal is to avoid users making mistakes when using the tool. A bulk load task involves a specified range and a set of files to be loaded into that range. Users must specify the range to load the data from the input files. Any data outside this range will be ignored. Upon completion of the bulk load task, the injected data will become visible to users. Any pre-existing data within the specified range will be discarded. To prevent user mistakes, we halt all traffic within the range during the bulk loading process (this functionality will be included in a separate PR).

User interface design

We aim to develop a user interface that facilitates the reliable submission of bulk load tasks and simplifies the monitoring of their progress. A key focus of the design is to ensure users are fully aware of their actions. To achieve this, we have embedded several design features. For instance, bulk load tasks remain persisted until users acknowledge their completion.

There are four key operations:

Set bulkLoad mode;
Submit a bulkLoad task;
Get bulkLoad task status;
Acknowledge the completion of a bulkLoad task.

In general, bulk load system provides both FDBCLI and transactional way to conduct the above all operations (This PR only includes FDBCLI based user interface for testing purposes. Transaction based interface will be included in a separate PR).

Setting bulk load mode Both FDBCLI and transactional methods submit the task by setting \xff/bulkLoadMode key. When the mode is turned on, DD restarts and keeps monitoring bulk load tasks submitted to \xff/bulkLoad/ and DD triggers bulk load tasks accordingly. When the mode is turned off, DD restarts and cancels all data moves issued for the bulk load task.

The FDBCLI command to set the bulkLoad mode: bulkload mode <on|off>

Submit a task Both FDBCLI and transactional methods submit the task by setting \xff/bulkLoad/. This key space will be read by DD which triggers task accordingly. It is noted that the input range must be within the user key space (aka. "" ~ \xff).

FDBCLI example to trigger a bulkLoad task: bulkload local 1 2 "/root/sim-load/bulkLoad/1" "/root/sim-load/bulkLoad/1/95af52bf2eddebe59a954db83895fa74-data.sst" "/root/sim-load/bulkLoad/1/51da3c579419558de34f00126efa25a0-bytesample.sst"

bulkload is the command name;
local indicates to load the file from a local file system which is accessible to any storage server in the (simulated) cluster (for testing purpose, such as simulation and loopback cluster testing);
1 2 indicates to load the file to KeyRange [1, 2);
/root/sim-load/bulkLoad/1 is the folder dedicated to the files to load;
/root/sim-load/bulkLoad/1/95af52bf2eddebe59a954db83895fa74-data.sst is the data file to load;
/root/sim-load/bulkLoad/1/51da3c579419558de34f00126efa25a0-bytesample.sst is the bytesSample file to load. BytesSample is used for FDB data distribution system to do the load balancing.

Get bulk load task status After a bulk load task is triggered, users can use fdbcli or transactional API to get the status of the task. By providing a range, the client will get all bulk load tasks intersecting the range.

FDBCLI example to get a bulkLoad task status to get all bulk loading of the entire user space: bulkload status "" \xff

Output of the bulk load status:

[Complete]: BulkLoadState: [Range]: 1 - 2, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 4, [Folder]: /root/bulkLoadLargeData/1, [DataFiles]: /root/bulkLoadLargeData/1/1_2_8490786aed2af0dddb7d3f07f389a6ad-data.sst, [SubmitTime]: 1721444374.400875, [TriggerTime]: 1721444374.748643, [StartTime]: 1721444374.754608, [CompleteTime]: 1721444395.008611, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/1/1_2_520c5b91be58e3290d06ad657c4977fd-bytesample.sst, [DataMoveId]: f5d88e9b6314879dd9b1c55013000704, [TaskId]: 00792a12b959887df3452088ccd0f8fd
[Running]: BulkLoadState: [Range]: 2 - 3, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 3, [Folder]: /root/bulkLoadLargeData/2, [DataFiles]: /root/bulkLoadLargeData/2/2_3_410cdcd09dd7a21f09a20f2b9f54c5e7-data.sst, [SubmitTime]: 1721444377.225051, [TriggerTime]: 1721444397.024549, [StartTime]: 1721444397.029248, [CompleteTime]: 0.000000, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/2/2_3_af27630f707cb049196a487f76abcbc5-bytesample.sst, [DataMoveId]: f5ac49a76f5a4c70744d7bf28f1f0704, [TaskId]: 8c406a1648f9147ee2e31a7a983b18b4
[Running]: BulkLoadState: [Range]: 3 - 4, [Type]: 1, [TransportMethod]: 1, [InjectMethod]: 1, [Phase]: 3, [Folder]: /root/bulkLoadLargeData/3, [DataFiles]: /root/bulkLoadLargeData/3/3_4_71139ad46eecbe1e078bcfde7feda48d-data.sst, [SubmitTime]: 1721444379.184671, [TriggerTime]: 1721444397.012706, [StartTime]: 1721444397.017751, [CompleteTime]: 0.000000, [RestartCount]: 0, [ByteSampleFile]: /root/bulkLoadLargeData/3/3_4_8231c3dc13b1481866977b471be4a260-bytesample.sst, [DataMoveId]: 6c71cd55450ee428d22257ac7c6d0704, [TaskId]: fb98642d309300711c204d0723708783
Finished task count 1 of total 3 tasks

The output shows that there are three bulk loading tasks overlapping the input range. In the status, it includes settings of tasks, such as file names and loading range. Moreover, the task status reports the progress. A bulkLoad task has 4 major stages: (1) submitted; (2) triggered; (3) start running; (4) running complete. The status records the time for each stage when the stage is end. If a stage is not end, the corresponding time is unset. User can use those time to track the progress of a task. The status also includes information for debugging purpose, such as the ID of the data move that is working on the bulk loading task. If a bulk load task is stuck, users can use this data move ID to check the progress of the data move.

Acknowledge the completion of a bulk load task After a bulk load task is completed, users can use fdbcli or transactional API to mark the task as acknowledged on metadata by providing a task ID and a range. DD erases acknowledged tasks' metadata in background.

FDBCLI example to acknowledge completion of a task: bulkload acknowledge 7599f0eadaf7df66222cc540b98bf224 1 2 This command will get the task status of range [1, 2) at first and if the task ID matches, the task metadata will be erased.

Backend design

We aim to develop a bulk loading backend compatible with various storage engines. Therefore, the backend design must be flexible, straightforward, and aligned with the current data distribution architecture. When a user submits a task, it is stored in the bulkLoad system key space as previously described. The Data Distributor (DD) periodically checks for new tasks. Upon detecting a new task, the DD ensures its completion by initiating data moves. These data moves signal storage servers by setting the ServerKey in the system key space. When a storage server (SS) receives a bulk load task, it downloads or copies the necessary files to a local directory and injects the data into the key-value store. Once the data move is completed, the injected data becomes accessible, and the bulk load task completes.

A bulk load task is defined by:

Range (aka. loading any data within the range from the input file(s))
Loading folder (aka. folder name that contains all files to inject)
Data files (aka. data files includes kvs of injecting user data)
Bytes sample file (aka. including bytes sampling data, required by the data distribution)
File format (currently SST file only)
File transport method (currently local CP only for simulation tests; will support S3)
File injection method (currently directly injecting file to sharded rocksdb; will support kv injection and sqlite engine).

Running a bulk load task has following constraints:

When a bulk load task takes effect, it always drops all existing data on the task range before the version when keyServer is updated due to the data move;
DD marks the range as the bulk loading range. Any shard boundary change is disabled on the loading range;
Require DD restarts to rebuild estimated DB size (this is achieved by turn off bulkLoadMode, which triggers DD restarts).

Life cycle of a bulk load task --- 5 stages:

Submitted: A bulk load task is created by users and the task is initialized with BulkLoadPhase::Invalid and a random taskID and all necessary configuration. Then this task is persisted to \xff/bulkLoad/ key space;
Triggered: When DD bulk load mode is enabled, DD periodically takes tasks from \xff/bulkLoad/ key space to handle. When DD starts handling this task, DD persisted the task phase as BulkLoadPhase::Triggered;
Running Started: DD triggers a data move to work on the task. When the data move metadata is persisted, SS starts loading the data for the task. At this point, DD persists the phase as BulkLoadPhase::Running (when updating serverKey and keyServer) of the data move;
Running Complete: When all SSes complete the bulk loading, DD persisted the phase as BulkLoadPhase::Complete (when updating serverKey and keyServer) of the data move. At this point, the data move completes, and the corresponding task completes, and the new data is visible to the client.
Acknowledged: When users observe the completion of a task, users acknowledge it and the task metadata will be cleared by DD in an async way.

Bulk load tasks are persisted on \xff/bulkLoad/ in a form of key range map, therefore any range has at most one bulk loading task at a time. Each task has a task id to distinguish different tasks with the same range.

Key invariants of updating bulk loading task map:

Only users can set a new bulk load task on a range. When setting a new task, the task id is always new and the phase is always BulkLoadPhase::Invalid.
Once a new bulk load task is persisted, the internal states are immutable except for phase and data move ID. To change the file to load, users must submit a new bulk load task overwriting the existing bulk load task on the same range. This overwriting can happen at any time.
DD makes progress on a bulk load task by issuing data moves for the task. DDQueue guarantees that if this bulk load data move is cancelled, there must be a new data move (can be triggered for other reason, e.g. unhealthy data move) to take over the bulk load task.
A task becomes outdated if and only if the task is overwritten by a newer task or the task is cancelled (task cancellation is included in a separate PR). When a task becomes outdated, the bulk load system gets notified right before the system updates the task phase metadata on disk, and the outdated task does not proceed.
Before DD updating and persisting the new phase, DD reads the task in the same transaction and checks if the task ID is mismatch. If yes, this indicates that the bulk load task served by the current data move is outdated. In this case, this data move is cancelled.
When DD restarts, all existing persisted data move for bulk load tasks are cancelled and DD will trigger new data moves for those tasks.

Key bulk loading mechanisms and invariants:

When a bulk load task on a range is observed, DD (DDTracker) splits the shard to get a shard strictly same as the task range and trigger a data move for this shard with healthy priority.
DD marks the range as the bulk loading range. Any shard boundary change is disabled on the loading range. When the bulk load task completes, the boundary change on the range is re-enabled.
After DD triggers a data move for a bulk load task, DD listens on the complete signal channel of the task. DD will get notified once the task completes.
When DD selects a destination team for the bulk load data move, DD randomly chooses an healthy team that has no overlapping data with the source team. This mechanism guarantees that at any time, client reads from consistent data. Random team selection provides a certain level of load balancing.
Bulk loading signal delivers at a destination storage server by a normal path of data move --- via changing serverKeys space. When SS receives a bulk loading signal, it reads the bulk load task from the data move metadata. Then, the SS injects the external files as specified by the bulk load task. Note that if a SS shard is set to do bulk loading, this SS shard must not be a readWrite shard because of the point 4.
When DD restarts, bulk loading data moves are cancelled. This guarantees that any bulk load task cannot be alive across different DDs.

Simulation tests

Currently, in a certain frequent case, the MachineAttrition workload kills too many storage servers such that the simulation is stuck when finding remote destination teams when the number of data center is 6. To avoid this issue, we set generateFearless=false such that the maximum number of data centers is 4. In rare cases, the simulation cannot complete because CC is failed to detect halted DD, therefore no DD is running then. This is caused by the network partition between CC and DD. To avoid this issue, the simulation disables the network partition injection.

100k correctness test with 1 external timeout for sharded rocksdb engine: 20240723-173043-zhewang-ef27ca81eb64aea1 compressed=True data_size=37277237 duration=5947713 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:37:50 sanity=False started=100000 stopped=20240723-190833 submitted=20240723-173043 timeout=5400 username=zhewang

100K bulkLoading tests: 20240723-173130-zhewang-494457bfa21d04cb compressed=True data_size=37310972 duration=24308992 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=2:24:45 sanity=False started=100000 stopped=20240723-195615 submitted=20240723-173130 timeout=5400 username=zhewang

Feature dependency

For the simplicity and performance, this PR relies on the existing infrastructure of physical shard moves and ShardedRocksDB. So, temporarily, the bulk loading feature is only available with ShardedRocksDB engine and with following features enabled:

shard_encode_location_metadata = true
dd_physical_shard_move_probability = 1.0

However, this PR sets up a general framework of bulk loading for any storage engine. So, in the future, we will support RocksDB and SQLite and range-based data moves with setting shard_encode_location_metadata = true only.

Future works

TODO: (1) Disable user traffic when bulkLoad; (2) Add file checksum; (3) Check whether all data are within the input range when injecting the data to storage server; (4) Bulk load cancellation (Clear bulk load task metadata and trigger a new data move on the same range which will be a normal data move without bulk load task); (5) Allow bulk loading on a readWrite SS shard (optional).

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

[ ] The PR has a description, explaining both the problem and the solution.
[ ] The description mentions which forms of testing were done and the testing seems reasonable.
[ ] Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

[ ] This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
[ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:04:01
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:04:08
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:04:03
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:04:11
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:05:25
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 54f9e73ebcd705b8efcf26b8d9b88c2848799a2a
Duration 0:05:48
Result: :x: FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 413590fbed9da9eb16170df7a4d6bbba7798ad20
Duration 0:22:04
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 413590fbed9da9eb16170df7a4d6bbba7798ad20
Duration 0:37:20
Result: :x: FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 413590fbed9da9eb16170df7a4d6bbba7798ad20
Duration 0:44:19
Result: :x: FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 413590fbed9da9eb16170df7a4d6bbba7798ad20
Duration 0:55:30
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 8cd2507b9f8c2b31b9e5aba2acfd33ef26d5bbe2
Duration 0:21:38
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 8cd2507b9f8c2b31b9e5aba2acfd33ef26d5bbe2
Duration 0:35:03
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 8cd2507b9f8c2b31b9e5aba2acfd33ef26d5bbe2
Duration 0:44:27
Result: :x: FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 8cd2507b9f8c2b31b9e5aba2acfd33ef26d5bbe2
Duration 0:47:43
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 5 months ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 8cd2507b9f8c2b31b9e5aba2acfd33ef26d5bbe2
Duration 0:55:41
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci commented 5 months ago