Closed roomthily closed 9 years ago
Because of the tail-first ordering in luigi (still not sure about it but I do like the atomicity right now) and the very hadoopy focus, I am going to go with the following pattern:
For a task, parameters are 1) an upstream_task and 2) a config file
For a workflow, tasks are chained as those "anonymous" upstream_tasks and everything required for any given task is in that config.
This is, for now, the only way I'm seeing to keep atomic little tasks and keep them composable for workflows.
For the initial round of testing, go from some solr doc through to a cleaned service description or, ideally, triples as rdf. But get all the text tidied up.
Two outputs - the rdf triples and a nicely processed bag of words.