Hi heideltime team,
I'm Master Student at the University of Mannheim and currently building an
Temporal Information Extraction system using heideltime as a temporal tagger.
I encountered a bug in the StanfordPOSTaggerWrapper UIMA component
What steps will reproduce the problem?
1. Check the attached file "Breaking_Sample.txt"; it's a plain text version of
Apple's Wikipedia article.
2. Apply de.unihd.dbs.uima.annotator.stanfordtagger.StanfordPOSTaggerWrapper on
it
3. Check the JCas sentence annotations, respectively the sentences text you get
when building substrings on the annotations "begin" and "end" indexes.
What is the expected output? What do you see instead?
Expected: Sentences as shown in "Output_MyStanfordPOSTaggerWrapper.txt"
Actual: Sentences as shown in "Output_StanfordPOSTaggerWrapper.txt"
Issue starts with Sentence 117
What version of the product are you using? On what operating system?
1.7
OS X
Please provide any additional information below.
Results of my analysis are as following:
The weakness of the current implementation is the own calculation of an offset
value in conjunction with
relying on searching the document text with ".indexOf(thisWord, offset)".
To fit my needs I copied and reimplemented your component the code can be found
in "MyStanfordPOSTaggerWrapper.java".
From my perspective this implementation is more robust as it reuses the offsets
calculated by the Stanford Tokenizer.
If you have further questions please do not hesitate to contact me.
Bests
Norman
Original issue reported on code.google.com by norman.w...@gmail.com on 23 Jul 2014 at 2:31
Original issue reported on code.google.com by
norman.w...@gmail.com
on 23 Jul 2014 at 2:31Attachments: