Add text post-processing to luigi pipeline

b-cube / semantics-preprocessing

initial text preprocessors for the triplestore and feature classification

Other

2 stars 3 forks source link

Add text post-processing to luigi pipeline #65

Closed roomthily closed 9 years ago

roomthily commented 9 years ago

For the initial round of testing, go from some solr doc through to a cleaned service description or, ideally, triples as rdf. But get all the text tidied up.

Two outputs - the rdf triples and a nicely processed bag of words.

[ ] rdf pipeline
[ ] identify pipeline for all results
[ ] parse pipeline for bag of words

roomthily commented 9 years ago

Because of the tail-first ordering in luigi (still not sure about it but I do like the atomicity right now) and the very hadoopy focus, I am going to go with the following pattern:

For a task, parameters are 1) an upstream_task and 2) a config file

For a workflow, tasks are chained as those "anonymous" upstream_tasks and everything required for any given task is in that config.

This is, for now, the only way I'm seeing to keep atomic little tasks and keep them composable for workflows.