[C++][CSV] Allow missing columns at end of row

asfimport commented 3 years ago

Test scenario :

I read the same attched csv file in pandas and pyarrow to make a comparison,

With pandas it reads it into a df without problems and result is as follows:


import pandas as pd

df = pd.read_csv('test.csv', names=['col1', 'col2', 'col3', 'col4', 'col5','col6'])

>>df
      col1   col2    col3  col4  col5  col6
0  20210317  julie   23434  test  data   1.0
1  20210316   adam  232423  test   NaN   NaN

2. With pyarrow csv, I get a parse error:


from pyarrow import csv
import pyarrow as pa

read_options = csv.ReadOptions(column_names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
convert_options = csv.ConvertOptions(column_types=pa.schema(fields))
table = csv.read_csv('test.csv', read_options=read_options,                     convert_options=convert_options)

ERROR:

Traceback (most recent call last):
 File ".../test_pyarr.py", line 71, in <module>
   table = csv.read_csv('test.csv',
 File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
 File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 4

Is there a parameter that can be set to fill null values in case the column values are missing for the specified schema?

Reporter: Nithin Kumara Narayanaswamy Teekaramanaa

Related issues:

[C++] Configure a custom handler for rows with incorrect column counts (is related to)
Original Issue Attachments:
test.csv
PRs and other links:
GitHub Pull Request #10202

_{Note: This issue was originally created as ARROW-12001. Please see the migration documentation for further details.}

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: There isn't a parameter for this. It would probably be doable to add one, but would add non-trivial complexity to the CSV reader, so I'm rather reluctant. Which source is the data coming from?

asfimport commented 3 years ago

Nithin Kumara Narayanaswamy Teekaramanaa: Hi Antoine,

In our case the source is a snapshot of a db saved as csv.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Do you use a built-in database function? Does it have options to customize the CSV format?

asfimport commented 3 years ago

Nithin Kumara Narayanaswamy Teekaramanaa: This is not possible as source csv files are from an another system. But in principle does it not make sense that it writes null values in place if the data is missing provided the schema is specified?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: It may as well be an error in the system producing the CSV files. How do we know? Generally, it's not a good idea to let errors pass silently.

In any case, as I said, this would add complication in the core of the CSV reader, which is why it hasn't been done (yet?).

apache / arrow

[C++][CSV] Allow missing columns at end of row #27832

Related issues:

Original Issue Attachments:

PRs and other links: