codeaudit / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Exception when using TreeTagger in multi threaded environment #68

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When using the TreeTaggerPosLemmaTT4J annotator we get some exceptions.

These exceptions raised when using the pipeline for a while (~1000 Documents) 
with multiple threads. After the exception occures first, almost all following 
extractions raise one of these exception too.

dkpro version: 1.3.0
TreeTagger version: latest
OS: Windows Server 2010 64bit, or Windows 7 64bit

Any clue would be helpful.
Thanks,
Johannes

Original issue reported on code.google.com by Hense.Johannes on 6 Jun 2012 at 8:45

Attachments:

GoogleCodeExporter commented 9 years ago
TreeTaggerPosLemmaTT4J should be thread-safe. We didn't do intensive testing 
recently though. Stacktraces would be helpful. Did you try running the same 
pipeline single-threaded? Alternatively, there may be a problematic characters 
in one of the documents, adding TT4J 1.1.0 as a dependency to override the 
older version used in DKPro Core 1.3.0 might help as well.

Original comment by richard.eckart on 6 Jun 2012 at 8:51

GoogleCodeExporter commented 9 years ago
Sorry, I didn't see the attachment. The exception message

   Token stream out of sync.

hints that some bad character caused the communication between TT4J and 
TreeTagger to run out of sync. TT4J 1.1.0 should be better able to handle this. 
Please try upgrading to TT4J 1.1.0.

Original comment by richard.eckart on 6 Jun 2012 at 9:02

GoogleCodeExporter commented 9 years ago
thank you, for the quick response.
I'll try using the TT4J 1.1.0.

what are these bad characters? is this documented somewhere?
Our input is partially xml. Could this be the cause?

Original comment by Hense.Johannes on 6 Jun 2012 at 12:27

GoogleCodeExporter commented 9 years ago
It depends a bit on which encoding your TT model uses. There is no 
documentation except for the TT4J source code:

http://code.google.com/p/tt4j/source/browse/tt4j/trunk/org.annolab.tt4j/src/main
/java/org/annolab/tt4j/TreeTaggerWrapper.java#678

Briefly:
- Cannot deal with Unicode > 16 bit if not in Unicode mode
- Cannot deal with control characters (c >= 0x0000 && c <= 0x001B)

XML should not be a problem, but you will (probably) not get any annotations on 
XML tokens. We need to set up a test case for text with XML markup.

Original comment by richard.eckart on 6 Jun 2012 at 12:33

GoogleCodeExporter commented 9 years ago
I tried it again with TT4J 1.1.0 but got similar exceptions (see attachment)

I use a simple test document, sent by four threads with a 500ms delay.
The exceptions in the attachment are in the original sequence as they appear 
during the test.

If a single thread execute the same test, everything is fine.

If i sent a single document first, the next documents go well (even multi 
threaded). But after a few documents i got the same exceptions as before.

Original comment by Hense.Johannes on 6 Jun 2012 at 3:09

Attachments:

GoogleCodeExporter commented 9 years ago
May I ask how you do your multi-threading? From the stacktraces, I see that you 
are using uimaFIT's SimplePipeline - that's single-threaded - and you seem to 
be creating your JCas instances yourself. We did our multi-threading 
experiments so far using UIMA's CPE (requires a reader). If you want to try 
that instead of doing all the multi-threading yourself, you should have a look 
at

http://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de
.tudarmstadt.ukp.dkpro.lab.uima.engine.cpe/src/main/java/de/tudarmstadt/ukp/dkpr
o/lab/uima/engine/cpe/CpeBuilder.java

The "Token stream out of sync." seems to be gone now, so that's good. The 
exceptions you get now could happen because you are accessing the same CAS 
object from multiple threads or because different threads concurrently trigger 
internal initialization routines in UIMA that are not thread-safe. Getting the 
threading right yourself (that is without using CPE or UIMA-AS) might prove a 
little tricky. uimaFIT's SimplePipeline is not proven to work in multi-threaded 
environments. It is possible that the CpeBuilder ends up in uimaFIT at some 
point.

Original comment by richard.eckart on 6 Jun 2012 at 3:27

GoogleCodeExporter commented 9 years ago
I added a test case for text with markup. The markup is not tagged. TT4J per 
default runs TreeTagger with the -sgml option, so XML tags are not tagged.

Original comment by richard.eckart on 7 Jun 2012 at 12:08

GoogleCodeExporter commented 9 years ago
I'm getting the same errors ("Token stream out of sync"; see attached) with a 
single-threaded application.  I can provide further details on request.

Original comment by tristan.miller@nothingisreal.com on 11 Jun 2012 at 3:01

Attachments:

GoogleCodeExporter commented 9 years ago
@frettchen.ii: which TT4J version do you use? If it is 1.1.0, can you please 
set 

   TreeTaggerWrapper.TRACE = true;

That should provide some more information. 

Original comment by richard.eckart on 11 Jun 2012 at 3:49

GoogleCodeExporter commented 9 years ago
I'm using TT4J 1.1.0.  The problem happens only intermittently but I'll modify 
the code as suggested and post the output here if/when it recurs.

Original comment by tristan.miller@nothingisreal.com on 12 Jun 2012 at 8:40

GoogleCodeExporter commented 9 years ago
I've opened a separate issue 71 for the "token stream out of sync" issue, since 
this should be a separate bug from whatever multi-threading problems there may 
be: 

http://code.google.com/p/dkpro-core-asl/issues/detail?id=71

Original comment by richard.eckart on 12 Jun 2012 at 9:20

GoogleCodeExporter commented 9 years ago
The multi threading is done by our application layer. We have a web service, 
which may be called by many clients at the same time.

I've now build a minimalistic wrapper class around the TreeTaggerPosLemmaTT4J 
to synchronize the execution of the tree tagger (see attachment).

Our multi threaded test case works great now, and since the tree tagger is one 
of the smaller tasks in our pipeline, it doesn't hurt much to synchronize it.

thanks for your help.

Original comment by Hense.Johannes on 13 Jun 2012 at 9:43

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for the feedback. We'll test this again with the CPE at some point and 
leave the issue open at least until we figure out if that works properly or not.

Original comment by richard.eckart on 13 Jun 2012 at 9:47

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 13 Oct 2012 at 6:31

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 16 Feb 2013 at 11:00

GoogleCodeExporter commented 9 years ago
Never heard of this again... There were various fixes related to 
multi-threading in the recent UIMA 2.6.x releases, so possibly this has been 
resolved.

Original comment by richard.eckart on 6 Aug 2014 at 8:37