Closed GoogleCodeExporter closed 9 years ago
TreeTaggerPosLemmaTT4J should be thread-safe. We didn't do intensive testing
recently though. Stacktraces would be helpful. Did you try running the same
pipeline single-threaded? Alternatively, there may be a problematic characters
in one of the documents, adding TT4J 1.1.0 as a dependency to override the
older version used in DKPro Core 1.3.0 might help as well.
Original comment by richard.eckart
on 6 Jun 2012 at 8:51
Sorry, I didn't see the attachment. The exception message
Token stream out of sync.
hints that some bad character caused the communication between TT4J and
TreeTagger to run out of sync. TT4J 1.1.0 should be better able to handle this.
Please try upgrading to TT4J 1.1.0.
Original comment by richard.eckart
on 6 Jun 2012 at 9:02
thank you, for the quick response.
I'll try using the TT4J 1.1.0.
what are these bad characters? is this documented somewhere?
Our input is partially xml. Could this be the cause?
Original comment by Hense.Johannes
on 6 Jun 2012 at 12:27
It depends a bit on which encoding your TT model uses. There is no
documentation except for the TT4J source code:
http://code.google.com/p/tt4j/source/browse/tt4j/trunk/org.annolab.tt4j/src/main
/java/org/annolab/tt4j/TreeTaggerWrapper.java#678
Briefly:
- Cannot deal with Unicode > 16 bit if not in Unicode mode
- Cannot deal with control characters (c >= 0x0000 && c <= 0x001B)
XML should not be a problem, but you will (probably) not get any annotations on
XML tokens. We need to set up a test case for text with XML markup.
Original comment by richard.eckart
on 6 Jun 2012 at 12:33
I tried it again with TT4J 1.1.0 but got similar exceptions (see attachment)
I use a simple test document, sent by four threads with a 500ms delay.
The exceptions in the attachment are in the original sequence as they appear
during the test.
If a single thread execute the same test, everything is fine.
If i sent a single document first, the next documents go well (even multi
threaded). But after a few documents i got the same exceptions as before.
Original comment by Hense.Johannes
on 6 Jun 2012 at 3:09
Attachments:
May I ask how you do your multi-threading? From the stacktraces, I see that you
are using uimaFIT's SimplePipeline - that's single-threaded - and you seem to
be creating your JCas instances yourself. We did our multi-threading
experiments so far using UIMA's CPE (requires a reader). If you want to try
that instead of doing all the multi-threading yourself, you should have a look
at
http://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de
.tudarmstadt.ukp.dkpro.lab.uima.engine.cpe/src/main/java/de/tudarmstadt/ukp/dkpr
o/lab/uima/engine/cpe/CpeBuilder.java
The "Token stream out of sync." seems to be gone now, so that's good. The
exceptions you get now could happen because you are accessing the same CAS
object from multiple threads or because different threads concurrently trigger
internal initialization routines in UIMA that are not thread-safe. Getting the
threading right yourself (that is without using CPE or UIMA-AS) might prove a
little tricky. uimaFIT's SimplePipeline is not proven to work in multi-threaded
environments. It is possible that the CpeBuilder ends up in uimaFIT at some
point.
Original comment by richard.eckart
on 6 Jun 2012 at 3:27
I added a test case for text with markup. The markup is not tagged. TT4J per
default runs TreeTagger with the -sgml option, so XML tags are not tagged.
Original comment by richard.eckart
on 7 Jun 2012 at 12:08
I'm getting the same errors ("Token stream out of sync"; see attached) with a
single-threaded application. I can provide further details on request.
Original comment by tristan.miller@nothingisreal.com
on 11 Jun 2012 at 3:01
Attachments:
@frettchen.ii: which TT4J version do you use? If it is 1.1.0, can you please
set
TreeTaggerWrapper.TRACE = true;
That should provide some more information.
Original comment by richard.eckart
on 11 Jun 2012 at 3:49
I'm using TT4J 1.1.0. The problem happens only intermittently but I'll modify
the code as suggested and post the output here if/when it recurs.
Original comment by tristan.miller@nothingisreal.com
on 12 Jun 2012 at 8:40
I've opened a separate issue 71 for the "token stream out of sync" issue, since
this should be a separate bug from whatever multi-threading problems there may
be:
http://code.google.com/p/dkpro-core-asl/issues/detail?id=71
Original comment by richard.eckart
on 12 Jun 2012 at 9:20
The multi threading is done by our application layer. We have a web service,
which may be called by many clients at the same time.
I've now build a minimalistic wrapper class around the TreeTaggerPosLemmaTT4J
to synchronize the execution of the tree tagger (see attachment).
Our multi threaded test case works great now, and since the tree tagger is one
of the smaller tasks in our pipeline, it doesn't hurt much to synchronize it.
thanks for your help.
Original comment by Hense.Johannes
on 13 Jun 2012 at 9:43
Attachments:
Thanks for the feedback. We'll test this again with the CPE at some point and
leave the issue open at least until we figure out if that works properly or not.
Original comment by richard.eckart
on 13 Jun 2012 at 9:47
Original comment by richard.eckart
on 13 Oct 2012 at 6:31
Original comment by richard.eckart
on 16 Feb 2013 at 11:00
Never heard of this again... There were various fixes related to
multi-threading in the recent UIMA 2.6.x releases, so possibly this has been
resolved.
Original comment by richard.eckart
on 6 Aug 2014 at 8:37
Original issue reported on code.google.com by
Hense.Johannes
on 6 Jun 2012 at 8:45Attachments: