apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.54k forks source link

[C++][CSV] Allow missing columns at end of row #27832

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Test scenario :

I read the same attched csv file in pandas and pyarrow to make a comparison,

  1. With pandas it reads it into a df without problems and result is as follows:

    
    import pandas as pd
    
    df = pd.read_csv('test.csv', names=['col1', 'col2', 'col3', 'col4', 'col5','col6'])
    
    >>df
          col1   col2    col3  col4  col5  col6
    0  20210317  julie   23434  test  data   1.0
    1  20210316   adam  232423  test   NaN   NaN

     2.  With pyarrow csv, I get a parse error:

    
    from pyarrow import csv
    import pyarrow as pa
    
    read_options = csv.ReadOptions(column_names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
    convert_options = csv.ConvertOptions(column_types=pa.schema(fields))
    table = csv.read_csv('test.csv', read_options=read_options,                     convert_options=convert_options)
    
    ERROR:
    
    Traceback (most recent call last):
     File ".../test_pyarr.py", line 71, in <module>
       table = csv.read_csv('test.csv',
     File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
     File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
    pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 4

    Is there a parameter that can be set to fill null values in case the column values are missing for the specified schema?

Reporter: Nithin Kumara Narayanaswamy Teekaramanaa

Related issues:

Note: This issue was originally created as ARROW-12001. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: There isn't a parameter for this. It would probably be doable to add one, but would add non-trivial complexity to the CSV reader, so I'm rather reluctant. Which source is the data coming from?

asfimport commented 3 years ago

Nithin Kumara Narayanaswamy Teekaramanaa: Hi Antoine,

In our case the source is a snapshot of a db saved as csv.

 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Do you use a built-in database function? Does it have options to customize the CSV format?

asfimport commented 3 years ago

Nithin Kumara Narayanaswamy Teekaramanaa: This is not possible as source csv files are from an another system. But in principle does it not make sense that it writes null values in place if the data is missing provided the schema is specified?

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: It may as well be an error in the system producing the CSV files. How do we know? Generally, it's not a good idea to let errors pass silently.

In any case, as I said, this would add complication in the core of the CSV reader, which is why it hasn't been done (yet?).