dice-group / sask

Projectgroups Search and Extraction
GNU Affero General Public License v3.0
2 stars 10 forks source link

Complex workflows #62

Closed AndreSonntag closed 6 years ago

AndreSonntag commented 6 years ago

With this pull request we add the functionality for complex workflows. Complex workflow means for example: One file is the input for two different extractors or the database is connected to more than one extractor. The following picture is more informative. For this functionality, we are using multi-threading. Currently, we have three different kind of threads: PullTask, ExtractTask, StoreTask. These tasks represent the logic classes. The logic of the task executer allows to extend the thread pool for further task. Complex workflows are necessary for further functionality like ensemble learning or filter nodes.

image


Usage: In the current version, only FOX and the Cederic Extractor are supported. Unfortunately, Cederic produces just rubbish. FOX returns the Turtle format back, afterwards we parse this with the Apache Jena library to N-triples format. The parse function is located in the SASK-commons. Cederic produce N-triples by itself without additional parsing. The database expects N-triples.