apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[C++] can we use simdjson to replace rapidjson #35460

Open wanweiqiangintel opened 1 year ago

wanweiqiangintel commented 1 year ago

Describe the enhancement requested

As the performance result mentioned in simdjson community: the simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON. And the throughput of simdjson is much higher than that of rapidjson: image

So can we replace rapidjson with simdjson to implement json parser?

Component(s)

C++

kou commented 1 year ago

If simdjson is faster than RapidJSON for our use case too, I'm OK with this.

Could you try this and share our benchmark result? https://github.com/apache/arrow/blob/main/cpp/src/arrow/json/parser_benchmark.cc

lemire commented 1 year ago

We are available to help.

Note that simdjson is used by Apache Doris and ClickHouse.

kou commented 1 year ago

Great!

mapleFU commented 1 year ago

Seems that writer can still use original logic, but parser can make full use of simdjson?

kou commented 1 year ago

Is there any merit to use both RapidJSON and simdjson?

I think that using either RapidJSON or simdjson will reduce our maintenance cost.

pitrou commented 1 year ago

Agreed with @kou , we probably want to avoid depending on two different JSON libraries.

Interested people should try working on a PR.

pitrou commented 1 year ago

I'm skeptical switching to simdjson would improve performance a lot, btw. Parsing is only a small part of the work necessary to convert JSON to Arrow.