Should we disable Simdjson's UTF-8 validation?

facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.

https://velox-lib.io/

Apache License 2.0

3.52k stars 1.15k forks source link

Should we disable Simdjson's UTF-8 validation? #10639

Closed PHILO-HE closed 3 months ago

PHILO-HE commented 3 months ago

Description

Simdjson has a build option called SIMDJSON_SKIPUTF8VALIDATION to control whether to check UTF-8 encoding validity for JSON input. It is OFF by default to not allow illegal UTF-8 encoding. But we recently found Spark disregards illegal UTF-8 encoding in JSON string. If Presto behaves same as Spark on this, we can set SIMDJSON_SKIPUTF8VALIDATION=ON at build time, then Simdjson will skip this check at runtime.

@mbasmanova

mbasmanova commented 3 months ago

Spark disregards illegal UTF-8 encoding in JSON string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

PHILO-HE commented 3 months ago

Spark disregards illegal UTF-8 encoding in json string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

@mbasmanova, yes, Spark allows invalid UTF-8 encoding. If input contains invalid utf-8 encoded data, NULL is returned when Gluten + Velox is used. Only if I set SIMDJSON_SKIPUTF8VALIDATION=ON for simdjson build, the result is consistent with Spark.

I tried testing Presto with the same parquet file containing invalid utf-8 encoding. Looks Presto also allows invalid utf-8 encoding in json parsing, like Spark.

select json_extract(c1, '$.c') from tbl;

Could you help verify the result of Presto + Velox? You can use the parquet file unpacked from test.tar.gz. Thanks!

mbasmanova commented 3 months ago

@PHILO-HE In general, I believe there are no guarantees in either Spark or Presto on behavior when input is invalid UTF-8. Hence, I doesn't think we need (or should) try to match such behavior. It might be better to simply say that input is expected to be valid UTF-8 and if not, there will an error or undefined behavior.

See https://prestodb.io/docs/current/functions/string.html#string-functions

CC: @FelixYBW @rui-mo

PHILO-HE commented 3 months ago

@mbasmanova, thanks for your comment! I got your point. Now that we know there is no guaranteed behavior for invalid utf-8 input, should we just let Simdjson skip the check? See code link. I think JSON parser's performance can be improved if we do that.

mbasmanova commented 3 months ago

JSON parser's performance can be improved if we do that.

I feel it is safer to keep the validation and fail loudly if it fails as opposed to returning some "strange" result and have the user guess what's going on. How much of a perf boost do you observe for valid use cases?

PHILO-HE commented 3 months ago

How much of a perf boost do you observe for valid use cases?

@mbasmanova, I just tested. But no evident perf. gain in my test cases after the validation is disabled. Maybe, only some certain cases (e.g., long json input) can be benefited. We have made some change in Gluten. Let's close this issue. Thanks!

mbasmanova commented 3 months ago

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

PHILO-HE commented 3 months ago

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

@mbasmanova, we have some users require the output of Gluten + Velox is consistent with Spark. So we just set SIMDJSON_SKIPUTF8VALIDATION=ON to build simdjson lib and then install it.