frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
709 stars 148 forks source link

Consider making `frictionless extract` a normalization command (e.g. for dates) #1271

Open jze opened 2 years ago

jze commented 2 years ago

When exporting extracted data to CSV fields containing dates are not converted into ISO 8601.

The command

frictionless extract https://opendata.schleswig-holstein.de/dataset/12fb2027-d2d3-42c9-8774-34a70f584c0f/resource/602e974a-bed0-4ffc-b8a1-0f4744b23917/download/windkraftanlagen-2022-07-01.json

shows the correct dates in ISO 8601 format. However, when I export the data to CSV

frictionless extract --csv https://opendata.schleswig-holstein.de/dataset/12fb2027-d2d3-42c9-8774-34a70f584c0f/resource/602e974a-bed0-4ffc-b8a1-0f4744b23917/download/windkraftanlagen-2022-07-01.json 

the unconverted dates are returned.

version 5.0.0b9

shashigharti commented 1 year ago

Hi @roll

The date cell value is parsed as datetime (ISO format) object and its default string representation is "ISO format" https://github.com/python/cpython/blob/3.8/Lib/datetime.py#L976 https://github.com/frictionlessdata/framework/blob/main/frictionless/fields/date.py#L47

In extract function we use 'to_list' to do field mapping, so for csv, json and yaml, it is working fine. But for default format, 'supported_types = None' here 'row.to_list()' https://github.com/frictionlessdata/framework/blob/main/frictionless/helpers.py#L84

and type conversion doesn't take place and datetime object will be returned whose str representation is "ISO format" (as mentioned below) https://github.com/frictionlessdata/framework/blob/main/frictionless/table/row.py#L240

Soln: So for default format passing 'types=[]' will trigger format check for the fields here in https://github.com/frictionlessdata/framework/blob/main/frictionless/helpers.py#L84

data.append([cell if cell is not None else "" for cell in row.to_list(types=[])])

but I am not sure if that is the right solution so wanted your feedback before making changes. Thanks!

roll commented 1 year ago

Hi @jze,

The difference between the two examples might be because of the typo in the schema - https://opendatarepo.lsh.uni-kiel.de/schema/windkraftanlagen.schema.json - format %d.%m.%y instead of %d.%m.%Y

In general, the framework outputs data compatible with the provided schema. So it prints e.g. 01.01.2018 because it's the format of this field.

I'll rename this issue to make it a feature request - we're going to review the mechanics behind this in v6 as we're getting closer to providing frictionless convert. Currently, it's an open question shall frictionless extract returns "normalized" data or not