frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
696 stars 145 forks source link

The data resource has an error: properties "path" and "data" is mutually exclusive when calling resource.to_view() #1572

Open fjuniorr opened 1 year ago

fjuniorr commented 1 year ago

Overview

I want to be able to change both metadata and the data in a transform pipeline and export both data and metadata (I think as in https://github.com/frictionlessdata/frictionless-py/issues/1062). An example might be:

from frictionless import Resource, steps, Pipeline

pipeline = Pipeline(steps=[
        steps.field_update(name='id', descriptor = {'name': 'pkey', 'title': 'Primary Key'}),
        steps.row_filter(formula='name != "france"'),
        steps.table_write(path='output.csv'),
        steps.resource_update(name='data', descriptor={'path': 'output.csv'}),
    ])

resource = Resource(path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/d6f2552b4fd950f459130eda9cf80ae0b8b4931e/data/transform.csv')
resource.transform(pipeline)
resource.to_yaml('resource.yaml')

print(f'{resource=}')
print(f'{resource.read_rows()=}')

which gives me what I want:

resource={'name': 'transform',
 'type': 'table',
 'path': 'output.csv',
 'data': [],
 'scheme': '',
 'format': 'inline',
 'mediatype': 'text/csv',
 'extrapaths': [],
 'schema': {'fields': [{'name': 'pkey',
                        'type': 'integer',
                        'title': 'Primary Key'},
                       {'name': 'name', 'type': 'string'},
                       {'name': 'population', 'type': 'integer'}]}}
resource.read_rows()=[{'pkey': 1, 'name': 'germany', 'population': 83}, {'pkey': 3, 'name': 'spain', 'population': 47}]

However if I run resource.to_view() I get

  File "/Users/fjunior/Projects/splor/datapackage-reprex/reprex/20230721T164121/venv/lib/python3.11/site-packages/frictionless/metadata.py", line 177, in from_descriptor
    raise FrictionlessException(error, reasons=errors)
frictionless.exception.FrictionlessException: [resource-error] The data resource has an error: descriptor is not valid (The data resource has an error: properties "path" and "data" is mutually exclusive)

Trying to set data to None in the pipeline

pipeline = Pipeline(steps=[
        steps.field_update(name='id', descriptor = {'name': 'pkey', 'title': 'Primary Key'}),
        steps.row_filter(formula='name != "france"'),
        steps.table_write(path='output.csv'),
        steps.resource_update(name='data', descriptor={'path': 'output.csv', 'data': None}),
    ])

also don't help because I get

  File "/Users/fjunior/Projects/splor/datapackage-reprex/reprex/20230721T164121/venv/lib/python3.11/site-packages/frictionless/transformer/transformer.py", line 92, in __iter__
    raise FrictionlessException(error) from exception
frictionless.exception.FrictionlessException: [step-error] Step is not valid: "resource_update" raises "'NoneType' object is not iterable" 

In general is mixing path and data a bad idea during a pipeline transformation? Is there other alternative to deal with the use case of changing and exporting both data and metadata?

fjuniorr commented 1 year ago

Another possibly related behaviour is that trying to infer stats gives correct values for fields and rows, but not for hash which gets the value 'sha256:None' when in theory data should be coming from path (which is set to 'output.csv') and not the in-memory data.

from frictionless import Resource, steps, Pipeline

pipeline = Pipeline(steps=[
        steps.field_update(name='id', descriptor = {'name': 'pkey', 'title': 'Primary Key'}),
        steps.row_filter(formula='name != "france"'),
        steps.table_write(path='output.csv'),
        steps.resource_update(name='data', descriptor={'path': 'output.csv'}),
    ])

resource = Resource(path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/d6f2552b4fd950f459130eda9cf80ae0b8b4931e/data/transform.csv')
resource.transform(pipeline)
resource.infer(stats=True)

print(f'{resource=}')
shashigharti commented 1 year ago

Thanks @fjuniorr for reporting. We will investigate it.