StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.04k stars 1.82k forks source link

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

Open srlch opened 2 days ago

srlch commented 2 days ago

Why I'm doing:

In current implementation, JsonDocumentStreamParser use simdjson::ondemand::parser::iterate_many to parse multiple JSON document. This API need caller pass the max size of JSON document called, says max_json_lenght_in_file in a given file to allocate the a memory chunk to finish the parsing process. But the problem is that, the caller pass the file size instead of max_json_lenght_in_file and allocate huge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification

What I'm doing:

Introduce json_parse_many_batch_size to control the batch_size passed into simdjson::ondemand::parser::iterate_many. If json_parse_many_batch_size > 0, use json_parse_many_batch_size as batch size, otherwise use simdjson::dom::DEFAULT_BATCH_SIZE. For JsonDocumentStreamParser::get_current, parse the doc using a relative small buffer. If an exception is thrown because the buffer is too small, increase the buffer size and retry.

Fixes #issue https://github.com/StarRocks/StarRocksTest/issues/8636

What type of PR is this:

Does this PR entail a change in behavior?

If yes, please specify the type of change:

Checklist:

Bugfix cherry-pick branch check:

github-actions[bot] commented 1 day ago

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] commented 1 day ago

[FE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] commented 1 day ago

[BE Incremental Coverage Report]

:white_check_mark: pass : 28 / 33 (84.85%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: src/exec/json_parser.h 0 1 00.00% [77]
:large_blue_circle: src/exec/json_parser.cpp 28 32 87.50% [59, 67, 86, 91]