Open bhaskarshashank99 opened 1 month ago
This is a good suggestion, and I think it's quite complex to add bitmap indexes to sub-columns of flat JSON:
so, it's eed a very detailed discussion, if we want to supported bitmap index on subcolumns
we can use generated column
and bitmap index
to accomplish this requirement, as you can create index on a generated column
@Seaven : Thanks for your reply. I have thought through your concerns and proposing my below solutions for it. Before that let me give some background about setup. Currently we have several TB's of JSON data and 90% percent of JSON column data is partially schematized. So the below solution makes sense for users like me. After doing code browsing, I feel the below approach should mostly solve all the problems you mentioned.
Metadata issues: 1) Defining indexes on a path of JSON column: We need to build an infra to define a bitmap index on a sub-path of JSON column. Meaning during create table, we can add a new type of index for a JSON column and then specialize the sub-path indexes. Let me give a rough example:
CREATE TABLE `t1` (
`key` INT
`properties` JSON,
INDEX flat_index(properties) using FLAT_JSON_INDEX([{"/age", "INT", "BITMAP"}]
)
DUPLICATE KEY(`k1`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`k1`) BUCKETS 1
PROPERTIES ("replication_num" = "1");
Index Semantics:
---------------------
FLAT_JSON_INDEX[{"path", "data_type", "index_type"}]
When a user defines such a index, there is strict type checking for the corresponding sub-column path. If ingestion payload doesn't contain the corresponding path, we would append default nulls.
Due to the above semantics, the index writer will always create a column writer for the path and would then build the BITMAP index. So whatever paths are defined in the flat_json_index, sort of become the sub-columns and a partial schema.
The predicates that are accessing the columns having flat json index, can be normalized and be applied during segment initialization rather than pushing it for expression evaluation in the chunk iteration phase.
Bitmap iterators Since now we have the metadata of the paths hanging from the parent column, bitmap iterators on the sub-column readers can be initialized appropriately to facilitate predicate evaluation.
@Seaven: It would be helpful to let me know your thoughts on this. I am sure I might have missed some paths.
@bhaskarshashank99 In simple terms, I think your design is sound.
We indeed need support for special syntax in the FE, which will involve two feature: a. Support for flat JSON to specify extracted sub-columns and their types. b. Creating bitmap indexes on sub-columns, which might also need to be considered for support on struct sub-columns in the future.
In the BE, during compaction and reading, we need to handle cases where both flat JSON and raw JSON are present.
BTW, I think you can first use generated columns
and bitmap indexes
to test the effectiveness and check bloom index meets your expectations.
and @wangsimo0 @wyb cc
@Seaven: Thanks for the review. I will check the generated columns feature too.
In the BE, during compaction and reading, we need to handle cases where both flat JSON and raw JSON are present.
With respect to the above comment, can you help with some PR or unit testcases that cover this code path?
Just like the test process:
and step 4 can be executed before either Step 2/3 for segments, it can only be flat JSON or raw JSON, but between segments, there may be a mix of raw JSON and flat JSON.。 means that the handling of indexes for each segment might be different. This may not affect your implementation(I guess, because index only used in segment level), but it might require attention and verification
refs read process:
@Seaven: Thanks for your inputs. I need one suggestion when the user adds an index on existing table. Scenario is as below:
a
is flattened as INT.a
is different, it would be detected as JSON in destination segment C.
What is your suggestion for this? I was thinking in such scenarios, the ALTER JOB
should fail.Potential Data loss Issue in the current 3.3.4 main: I see that in
template <bool REMAIN>
bool JsonFlattener::_flatten_json(const vpack::Slice& value, const JsonFlatPath* root, vpack::Builder* builder,
uint32_t* hit_count)
{
..
auto func = **flat_json::JSON_EXTRACT_FUNC.at(child->second->type);**
...
}
void extract_number(const vpack::Slice* json, NullableColumn* result) {
try {
if (LIKELY(json->isNumber() || json->isString())) {
auto st = get_number_from_vpjson<TYPE>(*json);
if (st.ok()) {
result->null_column()->append(0);
down_cast<RunTimeColumnType<TYPE>*>(result->data_column().get())->append(st.value());
} else {
result->append_nulls(1);
}
} else if (json->isNone() || json->isNull()) {
result->append_nulls(1);
} else if (json->isBool()) {
result->null_column()->append(0);
down_cast<RunTimeColumnType<TYPE>*>(result->data_column().get())->append(json->getBool());
} else {
result->append_nulls(1);
}
} catch (const vpack::Exception& e) {
result->append_nulls(1);
}
}
If the extraction is not successfully, I see in the catch we are appending null as by default. Wouldn't this result in loss of data?
@bhaskarshashank99
Thanks @Seaven. I will go through it.
Feature request
The attached Bitmap indexes on flat json sub-columns.txt document presents the concept of enabling bitmap indexes on subcolumns of a JSON column by extending the Flat JSON feature introduced in version 3.3.4 and beyond.
Describe the solution you'd like By treating the subcolumns identified by FlatJson feature as regular columns, we can leverage this to build bitmap indexes. The attached document highlights similar solution and current first level blockers that need to be fixed.