Consider making `frictionless extract` a normalization command (e.g. for dates)

jze commented 2 years ago

When exporting extracted data to CSV fields containing dates are not converted into ISO 8601.

The command

frictionless extract https://opendata.schleswig-holstein.de/dataset/12fb2027-d2d3-42c9-8774-34a70f584c0f/resource/602e974a-bed0-4ffc-b8a1-0f4744b23917/download/windkraftanlagen-2022-07-01.json

shows the correct dates in ISO 8601 format. However, when I export the data to CSV

frictionless extract --csv https://opendata.schleswig-holstein.de/dataset/12fb2027-d2d3-42c9-8774-34a70f584c0f/resource/602e974a-bed0-4ffc-b8a1-0f4744b23917/download/windkraftanlagen-2022-07-01.json

the unconverted dates are returned.

version 5.0.0b9

shashigharti commented 1 year ago

Hi @roll

The date cell value is parsed as datetime (ISO format) object and its default string representation is "ISO format" https://github.com/python/cpython/blob/3.8/Lib/datetime.py#L976 https://github.com/frictionlessdata/framework/blob/main/frictionless/fields/date.py#L47

In extract function we use 'to_list' to do field mapping, so for csv, json and yaml, it is working fine. But for default format, 'supported_types = None' here 'row.to_list()' https://github.com/frictionlessdata/framework/blob/main/frictionless/helpers.py#L84

and type conversion doesn't take place and datetime object will be returned whose str representation is "ISO format" (as mentioned below) https://github.com/frictionlessdata/framework/blob/main/frictionless/table/row.py#L240

Soln: So for default format passing 'types=[]' will trigger format check for the fields here in https://github.com/frictionlessdata/framework/blob/main/frictionless/helpers.py#L84

data.append([cell if cell is not None else "" for cell in row.to_list(types=[])])

but I am not sure if that is the right solution so wanted your feedback before making changes. Thanks!

roll commented 1 year ago

Hi @jze,

The difference between the two examples might be because of the typo in the schema - https://opendatarepo.lsh.uni-kiel.de/schema/windkraftanlagen.schema.json - format %d.%m.%y instead of %d.%m.%Y

In general, the framework outputs data compatible with the provided schema. So it prints e.g. 01.01.2018 because it's the format of this field.

I'll rename this issue to make it a feature request - we're going to review the mechanics behind this in v6 as we're getting closer to providing frictionless convert. Currently, it's an open question shall frictionless extract returns "normalized" data or not

frictionlessdata / frictionless-py

Consider making `frictionless extract` a normalization command (e.g. for dates) #1271