Could some last "special" characters lead to lost tree-tagger processes?

GoogleCodeExporter commented 9 years ago

Hello,

When I apply tree-tagger with the Chinese parameter file provided by S; Sharoff
on a file tokenized by the Chinese Segmenter provided by E. Peterson  
(mandarintools.com) 

  tree-tagger -quiet -no-unknown -sgml -token -lemma /usr/local/share/tree-tagger/lib/zh.par zh-test.txt

It provides the output int he attached file called zh-test.res

However, I suppose that I've got the following problem with tt4j: 
the tree-tagger process doesn't end even if every tokens have been processed 
successfully!

Could it be due to a jump to skipToken in the method removeProblematicTokens of 
the class TreeTaggerWrapper?

Thanks in advance,
Jérôme Rocheteau

PS I use tt4j within the following UIMA wrapper:

http://code.google.com/p/ttc-project/source/browse/trunk/modules/uima-tree-tagge
r-wrapper/sources/fr/univnantes/lina/uima/engines/TreeTaggerWrapper.java

It could have somme bugs.

Original issue reported on code.google.com by jerome.rocheteau on 21 Oct 2011 at 3:06

Attachments:

GoogleCodeExporter commented 9 years ago

The attached file of this comment provides logs about the previous process. 
Actually, it miss the last line : « INFO: Stop Treetagger»

That's the bug I would like to fix!

Thanks in advance,
Jérôme R

Original comment by jerome.rocheteau on 21 Oct 2011 at 3:11

Attachments:

zh-test.dbg

GoogleCodeExporter commented 9 years ago

Hi Jérôme,

I am not sure if I understand your problem. I gather that you get the expected 
output but you notice that in the end the tree-tagger process still is running. 
If this is your problem, then it's a feature in TT4J and a bug in your wrapper. 
Override the "destroy()" method in your UIMA wrapper and invoke 
TreeTaggerWrapper.destroy() there to stop the background process.

Also comprehensive implementation of an UIMA integration for TreeTagger with 
TT4J can be found here:

http://code.google.com/p/dkpro-core-asl/source/browse/de.tudarmstadt.ukp.dkpro.c
ore-asl/trunk/de.tudarmstadt.ukp.dkpro.core.treetagger

Maybe you want use that instead of writing the whole thing again from scratch.

-- Richard

Original comment by richard.eckart on 21 Oct 2011 at 3:24

GoogleCodeExporter commented 9 years ago

Hi Richard,

It's not a bug of the UIMA Wrapper. You'll find attached a CLI to tt4j. 
The problem remains the same. I turn on the trace mode (see zh-test.dbg 
attached). 

The fact is that tt4j reader doesn't receive the ENDOFTEXT tag 
"<This-is-the-end-of-the-text />" although it has been send by the tt4j writer!

Thanks in advance
Jérôme

PS: I won't have written another uima wrapper for tree-tagger if I had known 
yours before :) It looks great.

Original comment by jerome.rocheteau on 24 Oct 2011 at 3:41

Attachments:

GoogleCodeExporter commented 9 years ago

Thank you for your investigation of the issue. I'll have a look at as soon as 
possible. Meanwhile, if you are inclined to continue investigating the issue, I 
suggest you try adding more ".\n" to the flush sequence in 
http://code.google.com/p/tt4j/source/browse/tt4j/trunk/org.annolab.tt4j/src/main
/java/org/annolab/tt4j/DefaultModel.java - since the data in your zh-test.dbg 
shows that input and output remain in sync until the end, increasing the length 
of the flush sequence is a good candidate to fixing the problem. Or maybe a 
different flush sequence is required for chinese.

Original comment by richard.eckart on 24 Oct 2011 at 5:23

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Ok. Setting up a test was faster than I though ;) The problem is the flush 
sequence. It seems that tree-tagger ignores the "." which I usually use to 
flush the output. When I change the flush sequence to 
".\n.\n.\n.\n.\n.\n.\n(\n)\n"  it works fine. For the other languages that I 
have tests for so far, that also works out, so I think I'll just change the 
default flush sequence.

Original comment by richard.eckart on 24 Oct 2011 at 5:37

GoogleCodeExporter commented 9 years ago

The changed flush sequence is in release 1.0.16 which should arrive in an hour 
or so on Maven Central. It worked for me in a test case that I set up with the 
DKPro Core TreeTagger wrapper. It should work for you as well.

Original comment by richard.eckart on 24 Oct 2011 at 6:43

GoogleCodeExporter commented 9 years ago

Thank you very much Richard.
it works fine for me too :)

Original comment by jerome.rocheteau on 25 Oct 2011 at 7:54

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 25 Oct 2011 at 8:01

Changed state: Fixed

alishia / tt4j

Could some last "special" characters lead to lost tree-tagger processes? #6