THSS-DB / TDB

Educational Database Management System for Software School of Tsinghua University
Mulan Permissive Software License, Version 2
13 stars 17 forks source link

fix: add force copy ability of class Value; unmatch between IndexScanPhysicalOperator and index_->create_scanner #11

Closed stackByStack closed 4 months ago

stackByStack commented 5 months ago

问题背景:

  1. 我现在想给单列做不等式的索引优化。
  2. 原来的api有这个能力(设定左右边界),但是实现有问题
  3. 具体体现如下
    1. 假设我现在 select * from index_lab where id > 99997;
    2. 那么我对应的有
      IndexScanPhysicalOperator *index_scan_oper = new IndexScanPhysicalOperator(table, index, table_get_oper.readonly(), 
      &value, false /* left_inclusive */, 
      nullptr, false /* right_inclusive */ );

之所以是nullptr是因为BplusTreeScanner::open中就是这样考虑的(右边无边界)

问题在于源码中没有能够将nullptr传递下去而是默认传递其内部的rightvalue.data()(在上述例子的情况下),导致BplusTreeScanner::open失败

修改思路:

添加left_null_right_null_的标识量,修改index_->create_scanner的传参逻辑即可,不影响原来的逻辑。


Background:

  1. I am currently looking to optimize inequality for a single-column index.
  2. The original API has the capability to set left and right boundaries, but there is a problem with the implementation.
  3. The specific issue is as follows:
    1. Assume I am now querying select * from index_lab where id > 99997;
    2. The corresponding C++ code snippet is:
      IndexScanPhysicalOperator *index_scan_oper = new IndexScanPhysicalOperator(table, index, table_get_oper.readonly(), 
      &value, false /* left_inclusive */, 
      nullptr, false /* right_inclusive */ );

The reason why nullptr is used is that this is how BplusTreeScanner::open is set (right boundary is unbounded).

The problem is that the source code does not allow nullptr to be passed down but instead defaults to passing its internal rightvalue.data() (in the aforementioned example), which causes BplusTreeScanner::open to fail.

Proposed Modification:

Add flags for left_null_ and right_null_, and modify the parameter logic in index_->create_scanner accordingly. This should not affect the original logic.

stackByStack commented 4 months ago

问题背景

  1. 我现在想做多列索引的等值查找。
  2. 在等值查找的情况下,IndexScanPhysicalOperator需要传入对应值的Value对象
  3. 在初始化Value对象前需要拼接多列数据,但显然拼接出的数据不是Value的常规类型,需要用传入字符数组的形式保存
  4. 问题在于,拼接数据对应的字符数组几乎不可避免地会存在0,而在Value的实现中,在拷贝字符数组时会通过0的存在判定是否截断,导致我们无法拼接出插入索引时的数据形式
  5. 另外需要修改find_index_by_field以支持multi-field index的查找

修改思路

  1. 添加一个新的初始化函数,这个函数会先将0修改为1(或其他非0数)解决0截断问题,完成拷贝后再将0复原(和插入索引时的数据保持一致)
  2. 需要注意的是,该初始化函数的安全性需由使用者保证,因为它依赖于传入的len决定读取字符数组的长度,同学们有其他办法也可以考虑pr来fix这个workaround。
  3. find_index_by_field修改为查找index.multi_fields(),与单列索引的情况相统一。

Background

  1. I'm currently looking to perform an equality search on a multi-column index.
  2. For an equality search using the IndexScanPhysicalOperator, a Value object corresponding to the required value must be provided.
  3. Before initializing the Value object, I need to concatenate the data from multiple columns. However, this concatenated data will not be in a typical format that the Value object can handle and thus needs to be stored in the form of a character array.
  4. The problem is that the inherent presence of null bytes (0s) in the concatenated character array is almost inevitable, and since the Value implementation decides whether to truncate copying the array based on the presence of null bytes, it’s impossible to concatenate the data in the same format as when it was inserted into the index.
  5. In addition, find_index_by_field needs to be modified to support multi-field index search.

Proposed Modification

  1. Add a new initialization function that will first replace null bytes (0s) with a non-null value (perhaps 1 or another non-zero number) to address the truncation issue. After the copy is completed, it will then revert the null bytes to their original state (thus maintaining consistency with the data format used when inserted into the index).
  2. It is important to note that the safety of this initialization function depends on the users, as it relies on the length parameter (len) provided to determine how much of the character array to read. If you have other solutions, you might consider submitting a PR to fix this workaround.
  3. find_index_by_field is modified to find index.multi_fields(), which is consistent with the case of single-column index.