apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.11k stars 3.44k forks source link

CSV reader cannot parse dates or times #41488

Open davlee1972 opened 3 months ago

davlee1972 commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I thought I would gather the same issues below..

Right now if you have any dates or times in a CSV file, the CSV reader will only successfully parse them if they are in a ISO format like YYYY-MM-DD or HH:MM:SS. Alternative formats like DD-MM-YYYY or MM/DD/YYYY or -H:MM:SS will just fail.. There is a timestamp_parsers option in CSV.ConvertOptions, but those formats only work on timestamp[x] columns.

The error below is common because only YYYY-MM-DD formatted strings are convertable to dates. CSV conversion error to date32[day]: invalid value '01-20-2000' Adding "%m-%d-%Y %H:%M:%S" or "%m-%d-%Y" to timestamp_parsers doesn't do anything since this is a date32[day] column.

https://github.com/apache/arrow/issues/26224 https://github.com/apache/arrow/issues/28303 https://github.com/apache/arrow/issues/33357 https://github.com/apache/arrow/issues/37180

I think the best solution is to add date_parsers and time_parsers options to CSV.ConvertOptions..

I have a current hack I implemented to be able to parse DATEs out of CSV files.. If the dataset schema being used to read a CSV file has any column data types that start with "DATE".. Change the schema and replace any date columns with a timestamp type. Include the alternative Date formats in timestamp_parsers.. i.e. "%d-%m-%Y", "%m/%d/%Y", etc.. Read the CSV file which will read the date string values in as timestamps.. Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().

Code to swap out date columns with timestamp columns in a schema for dataset api

        new_fields = []
        for field in self.arrow_schema.names:
            new_field = self.arrow_schema.field(field)
            if str(new_field.type).startswith("date"):
                new_fields.append(pa.field(field, pa.timestamp("s")))
            else:
                new_fields.append(self.arrow_schema.field(field))
        new_schema = pa.schema(new_fields)

Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table

                # convert column list into column dict selection
                if isinstance(columns, List):
                    columns = {column: dataset.field(column) for column in columns}

                # cast timestamps to date32 or date64 in schema definition
                columns = {
                    column: (
                        dataset.field(column).cast(
                            str(self.arrow_schema.field(column).type)
                        )
                        if column in self.arrow_schema.names
                        and str(self.arrow_schema.field(column).type).startswith("date")
                        else expr
                    )
                    for column, expr in columns.items()
                }

Component(s)

C++, Python

davlee1972 commented 3 months ago

This issue also extends to filtering.. using something like pc.field("abc") < datetime.now() won't work if you can't cast field abc to a date32[day].. But this expression filter works if you change the datatype of "abc" from a date to a timestamp in the schema.

davlee1972 commented 3 months ago

time32[x] types also cannot be parsed..

I had to add a hack for this as well..

Change any time32[x] types to timestamp[s] Add any custom formats to timestamp_parsers. In my case I added "%H:%M:%S". Read in the time column as timestamp. Cast it to time32[x] using pyarrow.compute.cast..

In my use case I'm seeing values like "7:55:00", "7:59:31" in my CSV files.. These just won't parse into time32[s] types because these values are not "07:55:00", "07:59:31".. There is no alternative format that can be injected into the CSV reader..

If I change the columns in the schema to timestamp[s] with a "%H:%M:%S" format in timestamp_parsers the CSV file is parsed without errors, but the values come out as: 1900-01-01 07:55:00 1900-01-01 07:59:31

I then have to cast them to time32[s]

>>> df
pyarrow.Table
ISSUEID: float
EFFECTIVETIME: timestamp[s]
----
ISSUEID: [[null,19398776]]
EFFECTIVETIME: [[1900-01-01 07:55:00,1900-01-01 07:59:31]]
>>> pc.cast(df['EFFECTIVETIME'], "time32[s]")
<pyarrow.lib.ChunkedArray object at 0x0000022FCCF34040>
[
  [
    07:55:00,
    07:59:31
  ]
]