[C++] csv::TableReader column names, Read() arguments

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

https://arrow.apache.org/

Apache License 2.0

14.68k stars 3.56k forks source link

[C++] csv::TableReader column names, Read() arguments #26221

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Some feature requests:

csv::TableReader column_names method, and/or schema method. This will (in most cases) require IO to get these from the file, but that's fine. There are use cases (we've seen in R) where it would help to be able to get the names from the file (e.g. when you specify column types, it's a map of column name to type, so you can't currently specify types without also specifying names)
Add Read(std::vector) like how feather (and parquet?) have so that you don't have to parse and allocate columns you don't want.

cc @pitrou @romainfrancois

Reporter: Neal Richardson / @nealrichardson

_{Note: This issue was originally created as ARROW-10219. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: I'm not sure I understand #1, can you explain a bit more? As for #2, by giving ConvertOptions::include_columns you can already restrict which columns you want to convert.

asfimport commented 4 years ago

Neal Richardson / @nealrichardson: I didn't know about include_columns, thanks.

Here's two use cases for being able to get the column names without reading the whole table:

R's various CSV readers all let you specify column types as an unnamed vector of types; column names can also be specified but via a different argument. But the arrow csv reader currently can't do this: you can't specify column types while allowing the column names to be read from the file. So in this case, I'd like to be able to instantiate a TableReader with the other given options, query to get the column names, and then use those to create the fully specified TableReader to call Read on.
Some of R's CSV readers let you specify columns to keep in (or exclude from) the resulting data frame either by integer indices or by some expression (e.g. starts_with("something")). In order to pass those to ConvertOptions::include_columns, I need to get the column names from the CSV so that I can translate those.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: cc @westonpace

asfimport commented 3 years ago

Weston Pace / @westonpace: It would probably be column_names and not schema. The table reader can do late inference so it may not know the final schema until the final table is read. But column_names should be pretty straightforward to add.

asfimport commented 2 years ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.