Closed PHILO-HE closed 3 months ago
Spark disregards illegal UTF-8 encoding in JSON string.
@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.
Spark disregards illegal UTF-8 encoding in json string.
@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.
@mbasmanova, yes, Spark allows invalid UTF-8 encoding. If input contains invalid utf-8 encoded data, NULL is returned when Gluten + Velox is used. Only if I set SIMDJSON_SKIPUTF8VALIDATION=ON
for simdjson build, the result is consistent with Spark.
I tried testing Presto with the same parquet file containing invalid utf-8 encoding. Looks Presto also allows invalid utf-8 encoding in json parsing, like Spark.
select json_extract(c1, '$.c') from tbl;
Could you help verify the result of Presto + Velox? You can use the parquet file unpacked from test.tar.gz. Thanks!
@PHILO-HE In general, I believe there are no guarantees in either Spark or Presto on behavior when input is invalid UTF-8. Hence, I doesn't think we need (or should) try to match such behavior. It might be better to simply say that input is expected to be valid UTF-8 and if not, there will an error or undefined behavior.
See https://prestodb.io/docs/current/functions/string.html#string-functions
CC: @FelixYBW @rui-mo
@mbasmanova, thanks for your comment! I got your point. Now that we know there is no guaranteed behavior for invalid utf-8 input, should we just let Simdjson skip the check? See code link. I think JSON parser's performance can be improved if we do that.
JSON parser's performance can be improved if we do that.
I feel it is safer to keep the validation and fail loudly if it fails as opposed to returning some "strange" result and have the user guess what's going on. How much of a perf boost do you observe for valid use cases?
How much of a perf boost do you observe for valid use cases?
@mbasmanova, I just tested. But no evident perf. gain in my test cases after the validation is disabled. Maybe, only some certain cases (e.g., long json input) can be benefited. We have made some change in Gluten. Let's close this issue. Thanks!
We have made some change in Gluten.
@PHILO-HE Curious, what is this change.
We have made some change in Gluten.
@PHILO-HE Curious, what is this change.
@mbasmanova, we have some users require the output of Gluten + Velox is consistent with Spark. So we just set SIMDJSON_SKIPUTF8VALIDATION=ON to build simdjson lib and then install it.
Description
Simdjson has a build option called
SIMDJSON_SKIPUTF8VALIDATION
to control whether to check UTF-8 encoding validity for JSON input. It is OFF by default to not allow illegal UTF-8 encoding. But we recently found Spark disregards illegal UTF-8 encoding in JSON string. If Presto behaves same as Spark on this, we can setSIMDJSON_SKIPUTF8VALIDATION=ON
at build time, then Simdjson will skip this check at runtime.@mbasmanova