datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

How does concatenate work? #161

Closed ColinMaudry closed 3 years ago

ColinMaudry commented 3 years ago

Hi!

I'm trying to use concatenate, using the documentation and the tutorial (the example is a bit cryptic :)), but I fail to make it work.

Here is the script: https://github.com/ColinMaudry/decp-table-schema-utils/blob/add-marches-sirene-table/scripts/flow.py

The resources decp and previous-decp have the same columns.

The command

python3 scripts/flow.py

yields the following error:

Téléchargement des données tabulaires précédentes...
Traceback (most recent call last):
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 79, in _process
    self.datapackage = self.process_datapackage(self.datapackage)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/helpers/datapackage_processor.py", line 15, in process_datapackage
    ret = next(self.dp_processor)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/processors/concatenate.py", line 93, in func
    assert not match
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/flow.py", line 85, in <module>
    decp_processing()
  File "scripts/flow.py", line 44, in decp_processing
    flow.process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/flow.py", line 15, in process
    return self._chain().process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 118, in process
    ds, _ = self.safe_process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 114, in safe_process
    self.raise_exception(exception)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 97, in raise_exception
    raise cause
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 102, in safe_process
    ds = self._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 75, in _process
    datastream = self.source._process()
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 86, in _process
    self.raise_exception(exception)
  File "/home/colin/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py", line 96, in raise_exception
    raise error from cause
dataflows.base.exceptions.ProcessorError: Errored in processor datapackage_processor in position #21: 

I have looked at the source code, but couldn't figure it out, especially the role of the suffix and prefix variables (I'm a beginner in Python).

akariv commented 3 years ago

Hey there @ColinMaudry

With concatenate it's important that all source resources are consecutive in the datapackage. From reviewing your script, it seems that between decp and prev-decp comes sans-titulaires.

The error message is obviously very cryptic which I'll fix in the next version.

ColinMaudry commented 3 years ago

Thanks @akariv , I have worked around it, but I take note for the next time!