freme-project / pipelines

Apache License 2.0
0 stars 0 forks source link

Roundtripping in pipelines #27

Closed jnehring closed 8 years ago

jnehring commented 9 years ago

I was thinking about what happens when one uses a pipeline with input format text/html and output format text/html.

I think when the pipeline internally uses other input / output formats (e.g. turtle) then it will fail in the last step because the last step has informat=turtle, outformat=text/html which is not supported.

It should work when every single pipeline step has informat=outformat=text/html . But this means conversion in every single pipeline step which has a bad performance.

e-Internationalization generates during conversion from original format to NIF an additional NIF document that stores the original input document in turtle. This file needs to be kept during the pipeline execution. At the end of the pipeline, the enrichment information is merged with the original input document in turtle to perform the conversion back to the original format.

@ghsnd do you have a solution on how to store the input document in turtle?

ghsnd commented 9 years ago

I think this is indeed a good idea to improve performance. So to summarize: if the input/output format is text/html, then the pipeline service should:

  1. Convert to NIF, and keep the additional original input doc;
  2. Perform the steps in the pipeline using NIF, "overriding" the original specified input/output parameters in each step;
  3. Take the ouput of the last step (NIF) and convert it to HTML using the stored original input doc.

Can e-Internationalization be used out of the box to perform the conversions and get the original input doc in NIF?

@ghsnd do you have a solution on how to store the input document in turtle?

I think it is not that hard to store the input document, be it in a temporary file or using hibernate...

ghsnd commented 8 years ago

I have a first implementation ready, merged to the main branch. Right now the skeleton file is kept in memory until the pipeline is completed (which works fine). Depending on the input it can grow rather large, but I can write it to a temporary file in stead and / or apply compression.