datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Does it work with non-tabular JSON? #157

Closed ColinMaudry closed 3 years ago

ColinMaudry commented 3 years ago

Hello,

I'm trying to load and display a simple JSON file, but I fail to do so.

simple.json:

{
  "marches": [
    {
      "id": "1",
      "name": "Colin"
    },
    {
      "id": "2",
      "name": "Anne Lise"
    }
  ]
}

Command line:

dataflows init simple.json

Writing processing code into simple_json.py
Running simple_json.py
Processing failed, heres the error:
Traceback (most recent call last):
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/processors/load.py", line 121, in process_datapackage
    return self.safe_process_datapackage(dp)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/processors/load.py", line 171, in safe_process_datapackage
    stream: Stream = Stream(self.load_source, **self.options).open()
  File "/home/colin/.local/lib/python3.8/site-packages/tabulator/stream.py", line 425, in open
    self.__extract_headers()
  File "/home/colin/.local/lib/python3.8/site-packages/tabulator/stream.py", line 802, in __extract_headers
    for index, header in list(enumerate(raw_headers)):
TypeError: 'NoneType' object is not iterable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 79, in _process
    self.datapackage = self.process_datapackage(self.datapackage)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/processors/load.py", line 123, in process_datapackage
    raise SourceLoadError('Failed to load source {!r} and options {!r}: {}'
dataflows.base.exceptions.SourceLoadError: Failed to load source 'simple.json' and options {'format': 'json', 'custom_parsers': {'xml': <class 'dataflows.processors.parsers.xml_parser.XMLParser'>, 'excel-xml': <class 'dataflows.processors.parsers.excel_xml_parser.ExcelXMLParser'>}, 'ignore_blank_headers': True, 'headers': 1}: 'NoneType' object is not iterable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "simple_json.py", line 21, in <module>
    simple_json()
  File "simple_json.py", line 17, in simple_json
    flow.process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/flow.py", line 15, in process
    return self._chain().process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 118, in process
    ds, _ = self.safe_process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 114, in safe_process
    self.raise_exception(exception)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 97, in raise_exception
    raise cause
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 102, in safe_process
    ds = self._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 75, in _process
    datastream = self.source._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 75, in _process
    datastream = self.source._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 75, in _process
    datastream = self.source._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 86, in _process
    self.raise_exception(exception)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 96, in raise_exception
    raise error from cause
dataflows.base.exceptions.ProcessorError: Errored in processor load in position #1: Failed to load source 'simple.json' and options {'format': 'json', 'custom_parsers': {'xml': <class 'dataflows.processors.parsers.xml_parser.XMLParser'>, 'excel-xml': <class 'dataflows.processors.parsers.excel_xml_parser.ExcelXMLParser'>}, 'ignore_blank_headers': True, 'headers': 1}: 'NoneType' object is not iterable

The resulting simple_json.py:

from dataflows import Flow, load, dump_to_path, dump_to_zip, printer, add_metadata
from dataflows import sort_rows, filter_rows, find_replace, delete_fields, set_type, validate, unpivot

def simple_json():
    flow = Flow(
        # Load inputs
        load('simple.json', format='json', ),
        # Process them (if necessary)
        # Save the results
        add_metadata(name='simple_json', title='''simple.json'''),
        printer(),
        dump_to_path('simple_json'),
    )
    flow.process()

if __name__ == '__main__':
    simple_json()

Could you please provide a simple code snippet that shows how I can load and process a JSON file?

Thanks!

akariv commented 3 years ago

Hi @ColinMaudry !

The init script is a bit simplistic (and maybe outdated), but it's possible to load this json file using the Python interface, as follows:

% cat > bla.json
{
  "marches": [
    {
      "id": "1",
      "name": "Colin"
    },
    {
      "id": "2",
      "name": "Anne Lise"
    }
  ]
}
% python
Python 3.7.8 (default, Aug 24 2020, 11:26:01)
Type "help", "copyright", "credits" or "license" for more information.
>>> import dataflows as DF
>>> DF.Flow(DF.load('bla.json', property='marches'), DF.printer()).process()
bla:
  #           id  name
       (integer)  (string)
---  -----------  ----------
  1            1  Colin
  2            2  Anne Lise

The key is to specify the exact property you'd like to extract from the json file using the property argument to load.

ColinMaudry commented 3 years ago

That works fine, thanks!

ColinMaudry commented 3 years ago

The property argument isn't documented, right? https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#load

akariv commented 3 years ago

Indeed, we use tabulator as the file handing engine, and most of load's arguments are passed as-is to tabulator. So, instead of re-documenting we rely on tabulator's documentation - e.g. property is documented here: https://github.com/frictionlessdata/tabulator-py#json-read--write