apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.61k stars 3.55k forks source link

[C++] Configurable read-ahead in CSV and JSON readers #28381

Open asfimport opened 3 years ago

asfimport commented 3 years ago

We are compiling Arrow C++ to WebAssembly and ran into the following issue with the CSV reader:

Browsers became very picky about the use of SharedArrayBuffers after the events around Spectre and Meltdown.

As a result, you have to compile Arrow to WebAssembly without threads if you don't want to run your website with very strict cross-origin isolation.

Unfortunately, the CSV reader seems to always spawn a thread for the read-ahead in both, the SerialStreamingReader and the SerialTableReader independent of whether use_threads is set.

Right now, this effectively means that you cannot use the CSV (and JSON) readers in threadless WebAssembly builds.

 

https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L839

https://github.com/apache/arrow/blob/4363fefe46dc357a9013f0f4bcdc235e1e2e8124/cpp/src/arrow/csv/reader.cc#L913

 

 

Reporter: Andre Kohn

PRs and other links:

Note: This issue was originally created as ARROW-12629. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: In both cases (CSV and JSON) this can probably be added to ReadOptions.

asfimport commented 2 years ago

Supun Kamburugamuva: What would be a good option name for this? 

One option would be 

read_ahead

But if we introduce this do we need to change all the readers?

One other option would be not to read ahead if 

use_threads = false

But this option is specifically for CPU threads. 

 

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: use_readahead = true would sound good to me.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.