facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.51k stars 1.15k forks source link

json_extract, json_extract_scalar, and json_size should support Jayway style JSON path syntax #7049

Open kevinwilfong opened 1 year ago

kevinwilfong commented 1 year ago

Description

The functions json_extract, json_extract_scalar, and json_size in Presto use the Jayway library to parse JSON paths and handle extraction https://github.com/json-path/JsonPath

Velox's version of these functions uses a JSON path tokenizer based on Presto's JsonPathTokenizer. JsonPathTokenizer supports a much simpler syntax, but is likely faster, Presto uses it if possible and falls back to Jayway's if it can parse the path.

Jayway's parser is quite extensive, supporting various operators, aggregates, regexes, filters.

It would be great if the Velox JSON path parser could support Jayway's syntax.

https://github.com/facebookincubator/velox/blob/e2ee0cad24d5407146d2da08b68c6701ee86e9da/velox/functions/prestosql/json/JsonPathTokenizer.h

mbasmanova commented 1 year ago

CC: @amitkdutta @zacw7 @aditi-pandit

mbasmanova commented 1 year ago

Some references:

kevinwilfong commented 1 year ago

There was an issue opened in simdjson recently for supporting JSONPath, this would get us most of the way there, I think Jayway just extended this with aggregates that can be tacked on the end of the path.

https://github.com/simdjson/simdjson/issues/2070

lemire commented 1 year ago

We will prioritize this feature in simdjson.

lemire commented 11 months ago

Update: This is more ambitious a proposal than I considered at first.

mbasmanova commented 6 months ago

In our production workload, we are seeing the following JSONPaths that are currently not supported.

  1. "foo" - Jayway allows paths that do not start with a '$'. For these paths, it simply prepends the path with '$.' before compiling.
  2. "$.[0].foo" - While JSONPath allows either dot-notation or bracket-notation, Jayway allows a mix.
  3. "$..name" - deep scan.

I believe we can support (1) and (2) with simple changes to JsonPathTokenizer and support (3) using something like "Tree Walking and JSON Element Types" in https://github.com/simdjson/simdjson/blob/master/doc/basics.md#json-path

See https://github.com/prestodb/presto/issues/22589 for additional context.

mbasmanova commented 6 months ago

@rui-mo @PHILO-HE Folks, do you know if Spark also uses JayWay to implement json-extract functions?

PHILO-HE commented 6 months ago

@rui-mo @PHILO-HE Folks, do you know if Spark also uses JayWay to implement json-extract functions?

Hi @mbasmanova, Spark's implementation is based on Jackson. Here are some findings.

  1. "foo" - Jayway allows paths that do not start with a '$'. For these paths, it simply prepends the path with '$.' before compiling.

Spark requires json path starts with "$".

  1. "$.[0].foo" - While JSONPath allows either dot-notation or bracket-notation, Jayway allows a mix.

Spark allows a mix also.

mbasmanova commented 6 months ago

@PHILO-HE Thank you for clarifying.