Several optimizations, refactoring

HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.

GNU General Public License v3.0

342 stars 67 forks source link

Several optimizations, refactoring #61

Open kno10 opened 7 years ago

kno10 commented 7 years ago

As I'm repeatedly running this on the entire Wikipedia(s) for experiments, it was worth the effort to optimize this thoroughly. Sorry, I cannot easily break this apart into smaller pull requests, because optimizations and code refactorings depend on each other; and I mostly did all of this because I needed it to become a lot faster... but I would appreciate if you test and benchmark this, as I believe this improves both the code and the runtime for everybody and not just for me. ;-)

kno10 commented 7 years ago

Some (preliminary) results from a benchmark that is still running, involving a pipeline with CoreNLP, a huge entity dictionary to match named entities, and HeidelTime: Results after 1.000.000 Wikipedia documents.

with HeidelTime 2.2.1: 7 hours 13 minutes with above patches: 4 hours 55 minutes

When the process has finished I can also compare result sizes etc. (as e.g. writing much less to the database will also be faster). But from these numbers, I do think my branch is faster. ;-)

I've just added code to (hopefully) be able to break down runtime into CoreNLP, Entity Dictionary, HeidelTime with more detail on the next run. Because right now I cannot tell how much (or little) of the remaining 4:55 is due to HeidelTime.

kno10 commented 7 years ago

Final runtime: with HeidelTime 2.2.1: 28 hours 19 minutes with above patches: 25 hours 23 minutes

longliveenduro commented 6 years ago

I created a fork and merged your PR but unfortunately then I get a lot of "dates" like this XXXX-XX-XX. This doesn't happen with the original version from master branch. My fork: https://github.com/longliveenduro/heideltime

kno10 commented 6 years ago

Do you have an example document? Such dates often are relative dates ("three days earlier") without a reference. There should be relative information available, even when the date cannot be formated as a calendar date.

longliveenduro commented 6 years ago

@kno10 yes, the test document I've used is

"Anfang 2018: Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen. Mehr in 60 Minuten! Zweimal im Jahr!"

Works pretty good with the master version of heideltime when choosing "News" as Document Type and supplying a document date.

FYI: I do use Heideltime Standalone with Treetagger:

val heidelTimeStandalone = new HeidelTimeStandalone(Language.GERMAN, DocumentType.NEWS, OutputType.TIMEML)
val timeMlResult = heidelTimeStandalone.process(doc.text, doc.publishingDate.toDate)

kno10 commented 6 years ago

Document type and publishing date certainly will have an effect. The XXXX-XX-XX annotations look as if the publishing date is not used here anymore. This may be that my code considers "Anfang 2018" to be a stronger temporal anchor than the documents publishing date. But if you try to anchor "Gestern" with "Anfang 2018", you cannot make this a date, so it probably behaves a bit more similar to Heideltime narrative mode in this particular case.

longliveenduro commented 6 years ago

I think I had the same issue with a simpler version without the "Anfang 2018".

"Gestern war ein schöner Tag. Nächste Woche wird es besser werden. Am kommenden Montag kommt ein weiteres Tief auf uns zu. Das Sturmtief wird gegen Montag Nachmittag um ca. 15.00 Uhr in Bayern ankommen."

Ok unfortunately I think this makes your performance improvements for our news articles quite unusable, because the publishing date should be a very strong anchor for news. Also the project owner should be aware of this when he might consider merging this PR.

kno10 commented 6 years ago

IIRC I had too many false anchorings with the old logic. But yes, this needs a full regression testing, but unfortunately there is no testsuite with well-defined behavior yet. Many of my changes are to make such behavior can be more cleanly controlled, and I could not retain undocumented behavior in all cases.

kno10 commented 6 years ago

Resolving using the document creation time is trivially disabled in this line:

https://github.com/kno10/heideltime/blob/799e611b20d3bcae3be9cb1bd8127901379f921c/src/de/unihd/dbs/uima/annotator/heideltime/ResolveAmbiguousValues.java#L131

so that is as easy as it can be to "fix". I think the original motivation was to always use the document creation time if provided - if you do not want to use it for resolving, just do not provide it. So the line should probably be ... = ParsedDct.read(jcas);.

But that is a design decision. I am not a fan of the except-narrative hidden rule.