acoli-repo / conll-rdf

Advanced graph rewriting and LLOD publication for CoNLL and other TSV formats
25 stars 9 forks source link

flush output stream #56

Open chiarcos opened 3 years ago

chiarcos commented 3 years ago

I've set up a workflow that reads natural language from stdin, produces a parse in a CoNLL format, then transforms that via CoNLLStreamExtractor(+CoNLLRDFUpdater)+CoNLLRDFFormatter and writes the result to stdout.

The problem is that CoNLL-RDF writes to stdout only after stdin is closed. This happens in Bash pipelines, with or without CoNLLRDFUpdater, and also with CoNLLRDFManager.

For replication, run

./run.sh CoNLLStreamExtractor '#' ID WORD | ./run.sh CoNLLRDFFormatter

and paste the following data in four steps

  1. copy and paste a table (any table with at least two columns)

    1   1   1   1
    2   2   2   2
    3   3   3   3
    4   4   4   4
  2. (enter empty line, in theory, this should lead to flushing into stdout)

  3. copy and paste another table

    1   1   1   1
    2   2   2   2
    3   3   3   3
    4   4   4   4
  4. (enter empty line)

  5. close stdin, e.g., with <CTRL>+D

At the moment, output is flushed only after step 5. Desired behavior is to flush twice (after 2 and 4).

Note that if this is confirmed, this is a major bug because it contradicts the entire idea of stream processing that CoNLL-RDF is designed for.

leogott commented 3 years ago

This is a bug with CoNLLRDF Formatter's RDF-Loader functionality. I'll have a look.

leogott commented 3 years ago

Apparently the formatter doesn't split sentences on encountering a new block of prefixes, or when encountering an empty line, but only on lines beginning with a # symbol. This Problem may be related to #32.

I think I've modified the Formatter to also split on empty lines while working on the CommonsCLI pull request, but I'm not entirely sure.

Interestingly entering an empty line after four "sentences" are in the pipeline results in three of them getting processed by the formatter. I wonder what's up with that...

 1  1   1   1
 2  2   2   2
 3  3   3   3
 4  4   4   4

 1  1   1   1
 2  2   2   2
 3  3   3   3
 4  4   4   4

 1  1   1   1
 2  2   2   2
 3  3   3   3
 4  4   4   4

 1  1   1   1
 2  2   2   2
 3  3   3   3
 4  4   4   4
chiarcos commented 3 years ago

Interestingly entering an empty line after four "sentences" are in the pipeline results in three of them getting processed by the formatter.

I have a similar behavior (using a fresh install). Below part of the original log. This is a CLI to enter a natural language sentence, then to parse it (that works), then to use CoNLL-RDF (extractor+updater+formatter) for some extraction task. "Sending" is what is sent to the CoNLL-RDF process (whitespaces normalized). The turtle is its output to stdout (I'm not reading it).

Responses are produced after the third input is sent. Note that the response is empty in this case (but it's the same with full sentences that normally return RDF), so the response is basically the comment and the prefixes, but no triples. When I send Test3, I get Test1 results:

extractor initialized ... ok
reading from stdin, terminate with empty line:
> Test1
sending:
# Test1

1       Test1   test1   X       CD      _       0       root    _       _       _       _

#_END_

.> Test2
sending:
# Test2

1       Test2   test2   NOUN    CD      Number=Sing     0       root    _       _       _       _

#_END_

.> Test3
sending:
# Test3

1       Test3   Test3   NOUN    CD      Number=Sing     0       root    _       _       _       _

#_END_

.> # Test1
@prefix : <file:///home/chiarcos/semantic-parsing/#> .
@prefix powla: <http://purl.org/powla/powla.owl#> .
@prefix conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix terms: <http://purl.org/acoli/open-ie/> .
@prefix x: <http://purl.org/acoli/conll-rdf/xml#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

Test4
sending:
# Test4

1       Test4   Test4   NOUN    CD      Number=Sing     0       root    _       _       _       _

#_END_

.> #_END_       # Test2
@prefix : <file:///home/chiarcos/semantic-parsing/#> .
@prefix powla: <http://purl.org/powla/powla.owl#> .
@prefix conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix terms: <http://purl.org/acoli/open-ie/> .
@prefix x: <http://purl.org/acoli/conll-rdf/xml#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
chiarcos commented 3 years ago

I tried to use the behavior above to improvise a workaround (just send multiple padding lines containing _IGN_), but that isn't reliable. In particular, I've been able to send more than three _IGN_ before I got a response, so this number is not fixed. Maybe, does this have anything to do with the parallelization within CoNLLRDFUpdater?

leogott commented 3 years ago

With the most recent update, behavior could have changed, but I just confirmed that ./run.sh CoNLLStreamExtractor '#' ID WORD | ./run.sh CoNLLRDFFormatter still doesn't behave as intended.

However it seems like ./run.sh CoNLLStreamExtractor '#' ID WORD properly handles each sentence, and ./run.sh CoNLLRDFFormatter by itself also pushes the old sentence properly, when a new one is encountered.

This could be a limitation of bash piping. It seems like the pipe instruction has a buffer that only gets flushed when it is filled sufficiently?

Ofc this should not apply to a pipeline set up with CoNLLRDF Manager, but I think bash piping may be out of our hands.

leogott commented 3 years ago

After discussing the issue, we figured out there are two major parts to it:

1) Sentence/Model Splitting works as designed, but the design of the RDF-Loader is such, that it waits for the most recent sentence to continue, meaning one sentence is always in the buffer if the stream is still open. (Or a new sentence is begun with @prefix or #comment)

  1. Sentences are buffered for longer than designed when passed to the bash-pipeline ./run.sh CoNLLStreamExtractor '#' ID WORD | ./run.sh CoNLLRDFFormatter.
  2. The same thing as 2) but for the Manager json-pipeline

The cleanest way to change 1) might be to modify the streaming between classes to be a Stream<Model> and not a PipedOutputStream. An easier but in many cases less clean method would be to inject special comments like # ---✁--- cut along the line that signify the end of a sentence whenever RDF is read from an input stream.

2) needs to be investigated. I'll add some Logging and will check to confirm or deny my previous statement.

I did not verify 3) yet. Will do that now.

leogott commented 3 years ago

I investigated the behavior of a json-pipeline, and the buffering issue 2) appears to be absent. Here is the json config I used (renamed so github let me upluad it). test-flush.json.txt

@chiarcos Please let me know if you run into problem 2) while using a json pipeline. As far as I can tell, your work-around with the injected comments should work there without fail.

leogott commented 3 years ago

The answers to https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe were illuminating. 2) can be worked around by switching the buffering mode for stdio calls to line-buffering. stdbuf -oL -eL ./run.sh CoNLLStreamExtractor '#' ID WORD | ./run.sh CoNLLRDFFormatter appears to work as designed. (Overall performance impact or improvement unknown).

It should be possible from inside java, to configure the StdIO buffer, but I haven't yet figured out how to do it.

chiarcos commented 3 years ago

Using a fresh install, stdbuf did not fix it for me. I still have a delay of one sentence.

I investigated the behavior of a json-pipeline, and the buffering issue 2) appears to be absent. Here is the json config I used (renamed so github let me upluad it). test-flush.json.txt

@chiarcos Please let me know if you run into problem 2) while using a json pipeline.

Not quite:

    $> ~/semantic-parsing/models/conll-rdf$ ./run.sh CoNLLRDFManager -c test-flush.json
    Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/util/JacksonFeature
            at com.fasterxml.jackson.databind.ObjectMapper.<init>(ObjectMapper.java:655)
            at com.fasterxml.jackson.databind.ObjectMapper.<init>(ObjectMapper.java:558)
            at org.acoli.conll.rdf.CoNLLRDFManager.readConfig(CoNLLRDFManager.java:55)
            at org.acoli.conll.rdf.CoNLLRDFManagerFactory.buildFromCLI(CoNLLRDFManagerFactory.java:22)
            at org.acoli.conll.rdf.CoNLLRDFManager.main(CoNLLRDFManager.java:43)
    Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.util.JacksonFeature
            at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            ... 5 more

;)

I strongly suspect 1) to be the reason. In fact I remember that when writing the first version of the code, there was a design decision to aggregate

leogott commented 3 years ago

Uh, interesting. I didn't catch that one beforehand.

Hotfix incoming. Hopefully later this evening.

chiarcos commented 3 years ago

I might have found it. CoNLL2RDF, line 180:

change

        for(String p : argsProperties)
            out.write(p.trim()+"\n");
        out.flush();            
    }

to

        for(String p : argsProperties)
            out.write(p.trim()+"\n");
    }
    out.flush();            

If that is the source of the problem, the error arises because CoNLLStreamExtractor appends a newline before calling CoNLL2RDF.conll2ttl() but it does flush() only if the last line is not a newline.

chiarcos commented 3 years ago

I committed the change, @leogott : please double-check that it works and close issue ;)

leogott commented 3 years ago

Given a Manager-pipeline or line buffered bash-pipeline of StreamExtractor and CoNLLRDF Formatter:

and paste the following data in four steps

1. copy and paste a table (_any_ table with at least two columns)
2. (enter empty line, in theory, this should lead to flushing into stdout)

StreamExtractor outputs the sentence, and Formatter receives it, waiting for a new comment or prefix to tell it the sentence is complete. (Behavior unchanged)

  1. copy and paste another table
  2. (enter empty line) The second sentence is passed from the StreamExtractor to the Formatter, which causes the latter to output the first sentence and wait if there is more to the second sentence.
  3. close stdin, e.g., with <CTRL>+D At this point the StreamExtractor terminates with sucess. The pipe to the Formatter is closed, causing it to output the second sentence and terminate with success. (Behavior unchanged) At the moment, output is flushed only after step 5. Desired behavior is to flush twice (after 2 and 4).

The Components expecting a CoNLL-RDF Stream currently continue to hold on to the last sentence they received, delaying the output by one each.

leogott commented 3 years ago

The core of your issue-report was the trickle-down delay, if I'm not mistaken? In that case I'd do the following before we close this issue:

chiarcos commented 3 years ago

The Components expecting a CoNLL-RDF Stream currently continue to hold on to the last sentence they received, delaying the output by one each.

Yes, but now this can be managed by sending a pseudo-sentence. That works for CoNLLStreamExtractor with -u, at least, so we get the core functionality to apply SPARQL to CoNLL input data. So, we're getting closer ...