feat(rapid): Repopulate the changes from the primary to the secondary

As we know that the analytical workloads are running on the AP engine. And, Shannonbase use the one system two copies system architecture, which means one copy of data is in transactional processing engine (the primary engine) and another copy is used for analytical workloads (the secondary engine). Therefore, the consistency between these two copies will be the most important things. We usually use data freshness to describe the differences between the data in the primary engine and the secondary engine.

TP engine runs DMLs which will make some changes in primary engine, in practice, but, how to notify secondary engine, and send the changes to secondary engine to keep trace the latest data in primary engine.

In this part, we will explore how primary populates the changes to secondary. what mechanism do we use, and why we chose this way to impl the changes repopulation between primary engine and secondary engine.

Overview

The one of the most important issue of HTAP is data synchronization between transactional processing engine and analytical processing engine. Otherwise, if it has not synchronization ,it would not be called an HTAP database. Consider these secnarios below:

If a user buys some items, system insert some records into transaction table, now, the manager want to get real analytical reports on our sale or stock, etc. Traditional soultion is that to do some ETL jobs, move the data from TP database to AP database and do analytical processing on AP database. But, now we can do these works in one database instance, that HTAP does.

There're two solution to impl HTAP, or more specifically, changes propagation method. Loosely-coupled architecture, Integrated-coupled architecture. Shannonbase use integrated-coupuled architecture. In mysql, we can use binlog or redo log to synchronize the changes. Before, we discuss our solution, at first, we want to give some explanations on redo log and binlog.

Redo log

Redo log explanation given by mysql as following:

The redo log is a disk-based data structure used during crash recovery to correct data written by incomplete transactions. During normal operations, the redo log encodes requests to change table data that result from SQL statements or low-level API calls. Modifications that did not finish updating data files before an unexpected shutdown are replayed automatically during initialization and before connections are accepted. For information about the role of the redo log in crash recovery, see Section 15.18.2, “InnoDB Recovery”.

The redo log is physically represented on disk by redo log files. Data that is written to redo log files is encoded in terms of records affected, and this data is collectively referred to as redo. The passage of data through redo log files is represented by an ever-increasing LSN value. Redo log data is appended as data modifications occur, and the oldest data is truncated as the checkpoint progresses.

From the description above, we know that redo log describe the changes made by SQL statement or api calls. Therefore, it can be a way to poulate the changes from innodb to rapid engine by parsing the redo logs. And, we can reuse some functions used in recovery modules. A LSN, rapid_lsn, which describe where the rapid engine has been replayed at, and it will persist into system table space.

Binlog

The binary log contains “events” that describe database changes such as table creation operations or changes to table data. It also contains events for statements that potentially could have made changes (for example, a DELETE which matched no rows), unless row-based logging is used. The binary log also contains information about how long each statement took that updated data. The binary log has two important purposes: For replication, the binary log on a replication source server provides a record of the data changes to be sent to replicas. The source sends the information contained in its binary log to its replicas, which reproduce those transactions to make the same data changes that were made on the source. See Section 17.2, “Replication Implementation”.

Certain data recovery operations require use of the binary log. After a backup has been restored, the events in the binary log that were recorded after the backup was made are re-executed. These events bring databases up to date from the point of the backup. See Section 7.5, “Point-in-Time (Incremental) Recovery”.

Another way to repopulate the changes from innodb to rapid engine is using binglog. But, from our point of view, binglog there're some drawbacks as tools to repopulate the changes. binlog is a logical log, it involves writing the binlog and relay log, and parsing the binlog and execute the statement which parsed from binlog. it's a long callstack, not a efficiency way.

Hence, it may lead to cannot repopulate the changes in time at heavy workload scenarios.

Which one do we use?

Where the brief discussion above, we draw the conculsion the redo log maybe is the best choice for repopulating the changes from innodb to rapid.

4.1 Implementation 4.1.1 Basic concept

/** Redo log - single data structure with state of the redo log system.
In future, one could consider splitting this to multiple data structures. */
struct alignas(ut::INNODB_CACHE_LINE_SIZE) log_t {
...
  /** The recent written buffer.
  Protected by: locking sn not to add. */
  alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_written;

  /** Used for pausing the log writer threads.
  When paused, each user thread should write log as in the former version. */
  std::atomic_bool writer_threads_paused;

  /** Some threads waiting for the ready for write lsn by closer_event. */
  lsn_t current_ready_waiting_lsn;

  /** current_ready_waiting_lsn is waited using this sig_count. */
  int64_t current_ready_waiting_sig_count;

  /** The recent closed buffer.
  Protected by: locking sn not to add. */
  alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_closed;
  ...
  /** Maximum sn up to which there is free space in both the log buffer
  and the log files. This is limitation for the end of any write to the
  log buffer. Threads, which are limited need to wait, and possibly they
  hold latches of dirty pages making a deadlock possible.
  Protected by: writer_mutex (writes). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t buf_limit_sn;

  /** Up to this lsn, data has been written to disk (fsync not required).
  Protected by: writer_mutex (writes). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t write_lsn;
  ...
  alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *flush_events;

  /** Number of entries in the array with events. */
  size_t flush_events_size;

  /** This event is in the reset state when a flush is running;
  a thread should wait for this without owning any of redo mutexes,
  but NOTE that to reset this event, the thread MUST own the writer_mutex */
  os_event_t old_flush_event;

  /** Up to this lsn data has been flushed to disk (fsynced). */
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t flushed_to_disk_lsn;
...
  alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn;
  alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *rapid_events;
  ...

In log_t, we add alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t rapid_lsn; to represent where the data has been repopulated to rapid, and initialization it in log_sys_create() and log_start().

the diagram of writing redo log more information refere to redo log structure

the log record structure is listed as following:

Type  + Space ID + Page Number + Body

We only consider the following redo log type, which are about dml operations.

MLOG_REC_INSERT
MLOG_REC_CLUST_DELETE_MARK
MLOG_REC_UPDATE_IN_PLACE

After that, we reuse the logic of reconvery, to parse the redo log and gets the content of data of sql statement, then apply the data to rapdi engine.

recv_parse_log_recs(), recv_single_rec, recv_multi_rec, etc. are used to parse the redo log at recovery stage. We can reuse these codes to impl our changes population logic.

4.1.2 Backgroud thread.

A new backgroud worker launched to do real changes repopulation job. a new mlog generated, this worker will be notified to do parse mlog, and apply the changes to rapid engine.

4.1.3 persistentence of rapid_lsn. rapid_lsn will be written to trx_sys page of system table space.

4.2 Implementation details

Background thread After secondary_load command executed, a new thread will be launched immediately, wich called repopulation thread.

void Populator::start_change_populate_threads(log_t* log) {
Populator::log_rapid_thread =
  os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log);

ShannonBase::Populate::pop_started = true;
Populator::log_rapid_thread.start();
}

and the thread function defined as below

static void parse_log_func (log_t *log_ptr) {
current_thd = (current_thd == nullptr) ? new THD(false) : current_thd;
THR_MALLOC = (THR_MALLOC == nullptr) ? &current_thd->mem_root : THR_MALLOC;

os_event_reset(log_ptr->rapid_events[0]);
//here we have a notifiyer, when checkpoint_lsn/flushed_lsn > rapid_lsn to start pop
while (pop_started.load(std::memory_order_seq_cst)) {
auto stop_condition = [&](bool wait) {
  if (population_buffer->readAvailable()) {
    return true;
  }

  if (wait) { //do somthing in waiting
  }

  return false;
};

os_event_wait_for(log_ptr->rapid_events[0], MAX_LOG_POP_SPIN_COUNT,
                  std::chrono::microseconds{100}, stop_condition);

byte* from_ptr = population_buffer->peek();
byte* end_ptr = from_ptr + population_buffer->readAvailable();

uint parsed_bytes = parse_log.parse_redo(from_ptr, end_ptr);
population_buffer->remove(parsed_bytes);
} //wile(pop_started)

pop_started.store(!pop_started, std::memory_order_seq_cst);
THR_MALLOC = nullptr;
if (current_thd) {
delete current_thd;
current_thd = nullptr;
}
}

The thread will waiting for the event to be singnaled. In log_buffer_write function, a new redo log was written into redo log buffer, a copy of that also was written into population buffer. After writing finished, to notify population thread to apply the changes.

lsn_t log_buffer_write(log_t &log, const byte *str, size_t str_len,
                       lsn_t start_lsn) {

...

    log_sync_point("log_buffer_write_before_memcpy");

    /* This is the critical memcpy operation, which copies data
    from internal mtr's buffer to the shared log buffer. */
    std::memcpy(ptr, str, len);
    auto type = mlog_id_t(*ptr & ~MLOG_SINGLE_REC_FLAG);
    if (ShannonBase::Populate::pop_started &&
        ShannonBase::Populate::population_buffer && (
        type == MLOG_REC_INSERT )) {
      ShannonBase::Populate::population_buffer->writeBuff(str, len);
      os_event_set(log.rapid_events[0]);
    }
...

Note: rapid population buffer is a lock-free ring buffer.

Shannon-Data / ShannonBase

feat(rapid): Repopulate the changes from the primary to the secondary #5