HeidelTime / heideltime

A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
GNU General Public License v3.0
342 stars 67 forks source link

python wrapper + speed #77

Open AlJohri opened 6 years ago

AlJohri commented 6 years ago

hi @JannikStroetgen, I'm trying to wrap heideltime to use in python but I'm running into issues with speed. it seems invoking via the CLI is becoming very slow on the order of 7-8 seconds per document. it's a rather short script as you can see here:

https://github.com/AlJohri/heideltime-python/blob/master/heideltime.py

two questions:

1) notice anything wrong in my invocation that would be slowing it down considerably? 2) do you have an web API version somewhere that powers the online demo? perhaps the majority of the time is just spent starting the JVM repeatedly. hitting an API where everything is already loaded and ready to go may be much faster

thanks!

kno10 commented 6 years ago

It is not just the startup cost of the JVM (but even that would already hurt if you care about performance). Stanford CoreNLP loads a huge language model. Loading this again and again for every document is likely where most of the time goes to. I doubt you will be able to "fix" this - NLP just requires large models. So avoid loading them repeatedly.

AlJohri commented 6 years ago

makes sense, thanks @kno10.

do you know of where the code that powers the online demo lives?

kno10 commented 6 years ago

I don't know.

AlJohri commented 6 years ago

@JannikStroetgen @kno10 I switched to writing an API in java.

The Stanford models are still getting loaded each time the process method is called.

Here is my heideltime wrapper factory:

package com.washpost.heideltime.heideltimeapi;

import java.util.Date;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import de.unihd.dbs.heideltime.standalone.*;
import de.unihd.dbs.heideltime.standalone.exceptions.*;
import de.unihd.dbs.uima.annotator.heideltime.resources.Language;

public class HeideltimeFactory {

    private static HeidelTimeStandalone ht = new HeidelTimeStandalone(
        Language.ENGLISH,
        DocumentType.NEWS,
        OutputType.XMI, // or OutputType.TIMEML
        "src/main/resources/config.props",
        POSTagger.STANFORDPOSTAGGER); // POSTagger.TREETAGGER, POSTagger.NO;

    public static String process(String text, String dctString) throws DocumentCreationTimeMissingException, ParseException {
        DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
        Date dct = df.parse(dctString);

        ht.process(text, dct);
        ht.process(text, dct);
        ht.process(text, dct);

        return ht.process(text, dct);
    }

}

As you can see, it uses the same ht to process the text three times in a row for test purposes.

2018-07-23 22:59:14.548  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTimeStandalone initialized with language english
2018-07-23 22:59:14.549  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : trying to read in file src/main/resources/config.props
2018-07-23 22:59:17.367  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTime initialized
2018-07-23 22:59:17.481  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : JCas factory initialized
2018-07-23 22:59:17.484  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:22.361  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:25.935  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:28.753  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:28.768  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:28.769  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:31.975  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:31.995  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted

The logs show its taking about 3 seconds to process each document perhaps becaues it is re-initializing the StanfordPOSTaggerWrapper each time since it says the Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger line multiple times.

Is there anyway to prevent reloading the Stanford models?

madimov commented 5 years ago

@AlJohri did you have any luck with this?

madimov commented 5 years ago

@AlJohri in that case, if you don't mind me asking, did you find a decent alternative?

AlJohri commented 5 years ago

@madimov we ended up switching projects so I didn't purse it much farther. I think you can do what @kno10 suggested of trying to prevent the models from loading on each iteration by digging into the Java code. The TreeTagger also works quickly enough. Alternatively, you can check out:

If you're working in python, my colleague found that using jpype is a good alternative to talking to a constantly running JAR if there's issues deploying a REST API (https://github.com/AlJohri/heideltime-api/).

madimov commented 5 years ago

@AlJohri thanks a lot for taking the time and the detailed response. I've actually been looking at the last one you listed, python-sutime, and it seems to be quite good. I'll be sure to check out all the rest as well and get back to you. Much appreciated

kno10 commented 5 years ago

I have been handling the Stanford NLP in my own code for other reasons, and only running HeidelTime on the already annotated document. I've been annotating hundreds of documents per second this way. I'm not a big fan of nesting libraries to deep exactly because of such issues: when to reload a GB-sized language model, and when allowing it to be garbage collected, is not a decision to "outsource" into a library, but something you need to control in the "driver".