Open cyberbikepunk opened 7 years ago
Also very useful would be to check that the data is in sync with the schema. When I started writing processors, I thought that I could get away with it, but it causes problems later on. The order of the processors becomes problematic, for example.
After a little thinking time, I think that we should go with the explicit solution (import this
after all). The decorator solution is a little harder to implement and a little too magic. It smells like the pytest
API, which is full of gotchas.
Overview
ingest
andspew
are awesome little functions, but I found there was still a lot of boilerplate to do for each processor. I suggest writing some additional wrapper code to deal with with that. I base my suggestions on use-cases I've encountered so far.Assumptions
I assume that most of the time, you either want to process data row by row and/or mutate the datapackage. Let's talk about row processing first.
Objectives
For row by row processing, the wrapper code would fulfill 2 purposes:
Provide boilerplate functionality
Pass context to the row processor
API
My first attempt at writing code for that resulted in the
utility.process
function. There's no stats functionality at this stage. The API looks like:My second attempt (see code in progress) is a
Processor
class with an API along the lines of:What I would really like to achieve is:
And similarly add datapackage mutation like:
@akariv care to comment?