HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
343 stars 67 forks source link

HeidelTime has not found any sentence tokens in this document. #40

Closed powpos360 closed 8 years ago

powpos360 commented 8 years ago

I tried to reproduce the evaluation result using WikiWars. Follow the wiki, I can reproduce same results using v2.1. However, I followed same steps using other versions (tried 1.3, 1.6, 1.7, and 1.8), but received ..[de.unihd.dbs.uima.annotator.heideltime.HeidelTime] HeidelTime has not found any sentence tokens in this document. HeidelTime needs sentence tokens tagged by a preprocessing UIMA analysis engine to do its work. Please check your UIMA workflow and add an analysis engine that creates these sentence tokens. everytime. I have changed the .bash_profile accordingly. Is there any other particular adjustments I should have done when setting up the experiment? Thanks a lot.

powpos360 commented 8 years ago

Alright, turns out that when creating uima flow, I need to add TreeTaggerWrapper annotator BEFORE heideltime annotator. This turns out to be crucial....

JannikStroetgen commented 8 years ago

Hi, Good to hear that you figured out the issue. If you run into any other problems, please let us know, too. Maybe we can reply faster next time and really provide some help ;-) Cheers, Jannik

parisni commented 7 years ago

hi

I turn in the same issue. The point is I use an external sentence annotator. (dkpro/openNLP). As a result, I get de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence instead of de.unihd.dbs.uima.types.heideltime.Sentence.

Is there any nice way to make heideltime work with such sentence annotator ?

Thanks by advance

kno10 commented 7 years ago

HeidelTime includes classes to translate sentence annotations e.g. from CoreNLP and TreeTagger into HeidelTime annotations. It should be fairly easy to add a similar translation for other taggers: https://github.com/HeidelTime/heideltime/blob/master/src/de/unihd/dbs/uima/annotator/stanfordtagger/StanfordPOSTaggerWrapper.java In addition to sentences, you will also want to translate POS when available, as this can help remove some false positives. I don't use UIMA, so I can't tell you how to invoke this in UIMA.

JannikStroetgen commented 7 years ago

At the very beginning of HeidelTime, we included an AnnotationTranslator in the UIMA kit, which took one kind of annoations, e.g., DKPro's sentence annotations and created heideltime's sentence annotations, but we removed it. You can downlaod an old HeidelTime UIMA kit version and check the details, e.g., version 1.9 https://github.com/HeidelTime/heideltime/releases/tag/VERSION1.9

parisni commented 7 years ago

Can you pleaze confirm me the only mapping I have to do is :

@kno10

In addition to sentences, you will also want to translate POS when available, as this can help remove some false positives.

My pipeline produces those POS annotation from: https://dkpro.github.io/dkpro-core/releases/1.7.0/apidocs/de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/package-summary.html If I understand correctly the code you provide in StanfordPOSTaggerWrapper.java, the POS are only Strings that are set to a Token Annotation object. Does that mean I just can push "NN" or "ADJ" into those strings and Heideltime wiill understand them out of the box ?

Thanks for your help

kno10 commented 7 years ago

The code I use (not using UIMA - only as little as necessary to run heideltime - nor DKpro) simply does this to convert the annotations for HeidelTime:

      for(CoreMap sentence : corenlp.sentences()) {
        Sentence sent = new Sentence(jcas);
        sent.setBegin(sentence.get(CoreNLPAnalyzer.DocumentStartOffsetAnnotation.class));
        sent.setEnd(sentence.get(CoreNLPAnalyzer.DocumentEndOffsetAnnotation.class));
        sent.setSentenceId(sentence.get(CoreNLPAnalyzer.DocumentSentenceAnnotation.class));
        for(CoreLabel label : sentence.get(TokensAnnotation.class)) {
          Token t = new Token(jcas);
          t.setBegin(label.get(CoreNLPAnalyzer.DocumentStartOffsetAnnotation.class));
          t.setEnd(label.get(CoreNLPAnalyzer.DocumentEndOffsetAnnotation.class));
          t.setPos(label.get(CoreAnnotations.PartOfSpeechAnnotation.class));
          t.addToIndexes();
        }
        sent.addToIndexes();
      }

By providing offsets you get the correct offsets from Heideltime. I don't know if the sentence is currently used by released Heideltime - my modified version uses it for resolving ambiguous dates.

I have been considering to add an abstraction layer in my branch, which could allow HeidelTime to operate directly on CoreNLP annotations, so I don't have to perform this copying. But that requires a considerable effort.

Make sure to double-check sentence splitter quality. For example CoreNLP without workarounds will split "Der 3. November" into two sentences, because it is a bit overoptimized for English.

JannikStroetgen commented 7 years ago

HeidelTime makes use of the following preprocessing information

Temporal expressions across sentence boundaries won't be detected. The issue with wrong sentence splitting, which Erich pointed out, is the reason why we included several modifications for the sentence splitting process, e.g., for German and French.

parisni commented 7 years ago

I have been able to use heideltime in my own pipeline. I also created a simple mapper annotator. This mapper may be enhanced for other tasks, will see. (create, update, delete, merge annotation)

Maybe in few time I will post the details on how to put heideltime in one own pipeline.

Thanks guys.