Erikvl87 / docker-languagetool

Dockerfile for LanguageTool server - configurable
https://hub.docker.com/r/erikvl87/languagetool
GNU Lesser General Public License v2.1
449 stars 57 forks source link

Version 5.3 and above always crash (fasttext) #25

Open Aculeasis opened 3 years ago

Aculeasis commented 3 years ago

erikvl87/languagetool:5.2 works fine:

The following configuration is passed to LanguageTool:
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
+ java -Xms512m -Xmx2g -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8010 --public --allow-origin '*' --config config.properties
2021-10-15 21:53:06 +0000 INFO  org.languagetool.server.DatabaseAccess Not setting up database access, dbDriver is not configured
2021-10-15 21:53:06 +0000 WARNING: running in HTTP mode, consider running LanguageTool behind a reverse proxy that takes care of encryption (HTTPS)
2021-10-15 21:53:06 +0000 WARNING: running in public mode, LanguageTool API can be accessed without restrictions!
2021-10-15 21:53:07 +0000 INFO  org.languagetool.language.LanguageIdentifier Started fasttext process for language identification: Binary /fasttext/fasttext with model @ /fasttext/lid.176.bin
2021-10-15 21:53:07 +0000 Setting up thread pool with 10 threads
2021-10-15 21:53:07 +0000 Starting LanguageTool 5.2 (build date: 2020-12-30 14:55, eb572bf) server on http://localhost:8010...
2021-10-15 21:53:07 +0000 Server started

But newer versions already crash :(

fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
+ java -Xms512m -Xmx2g -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8010 --public --allow-origin '*' --config config.properties
2021-10-15 22:01:37.371 +0000 INFO  org.languagetool.server.DatabaseAccess Not setting up database access, dbDriver is not configured
2021-10-15 22:01:37 +0000 WARNING: running in HTTP mode, consider running LanguageTool behind a reverse proxy that takes care of encryption (HTTPS)
2021-10-15 22:01:37 +0000 WARNING: running in public mode, LanguageTool API can be accessed without restrictions!
Exception in thread "main" java.lang.RuntimeException: Could not start LanguageTool HTTP server on localhost, port 8010
    at org.languagetool.server.HTTPServer.main(HTTPServer.java:153)
Caused by: org.languagetool.server.PortBindingException: LanguageTool HTTP server could not be started on host "null", port 8010.
Maybe something else is running on that port already?
    at org.languagetool.server.HTTPServer.<init>(HTTPServer.java:119)
    at org.languagetool.server.HTTPServer.main(HTTPServer.java:147)
Caused by: java.lang.RuntimeException: Could not start fasttext process for language identification @ /fasttext/fasttext with model @ /fasttext/lid.176.bin
    at org.languagetool.language.LanguageIdentifier.enableFasttext(LanguageIdentifier.java:118)
    at org.languagetool.server.TextChecker.<init>(TextChecker.java:109)
    at org.languagetool.server.V2TextChecker.<init>(V2TextChecker.java:45)
    at org.languagetool.server.LanguageToolHttpHandler.<init>(LanguageToolHttpHandler.java:74)
    at org.languagetool.server.HTTPServer.<init>(HTTPServer.java:105)
    ... 1 more
Caused by: java.io.IOException: Cannot run program "/fasttext/fasttext": error=2, No such file or directory
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
    at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
    at org.languagetool.language.FastText.<init>(FastText.java:43)
    at org.languagetool.language.LanguageIdentifier.enableFasttext(LanguageIdentifier.java:115)
    ... 5 more
Caused by: java.io.IOException: error=2, No such file or directory
    at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
    at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
    at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
    at java.base/java.lang.ProcessBuilder.start(Process

I built fasttext from here and downloaded, probably, lid.176.bin from here. My docker runner:

docker run -d --name="Languagetool" \
-p 8081:8010/tcp \
-e Java_Xms=512m \
-e Java_Xmx=2g \
-e langtool_languageModel=/ngrams \
-e langtool_fasttextModel=/fasttext/lid.176.bin \
-e langtool_fasttextBinary=/fasttext/fasttext \
-v "/mnt/hdd1/languagetool/ngrams":"/ngrams" \
-v "/mnt/hdd1/languagetool/fasttext":"/fasttext" \
--restart=unless-stopped \
erikvl87/languagetool:5.2

docker version:

Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.7-0ubuntu1~20.04.2
 Built:             Fri Oct  1 14:07:06 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.7-0ubuntu1~20.04.2
  Built:            Fri Oct  1 03:27:17 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.2-0ubuntu1~20.04.3
  GitCommit:        
 runc:
  Version:          1.0.0~rc95-0ubuntu1~20.04.2
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        

So, what am I doing wrong?

dprothero commented 2 years ago

I'm working my way through this and haven't gotten all the way there yet, but I did resolve the "No such file or directory" issue. The fasttext binary has to be built on alpine linux to work. I'll post my completed setup when I get it working. Now, I'm getting java.lang.OutOfMemoryError loading the ngram data for language identification.

dprothero commented 2 years ago

If you create a Dockerfile in an empty folder with these contents:

FROM alpine as ftbuild

RUN apk update && apk add \
        build-base \
        wget \
        git \
        unzip \
        && rm -rf /var/cache/apk/*

RUN git clone https://github.com/facebookresearch/fastText.git /tmp/fastText && \
  rm -rf /tmp/fastText/.git* && \
  mv /tmp/fastText/* / && \
  cd / && \
  make

RUN wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

RUN wget https://languagetool.org/download/ngram-lang-detect/model_ml50_new.zip

FROM erikvl87/languagetool

COPY --chown=languagetool --from=ftbuild /fasttext .
COPY --chown=languagetool --from=ftbuild /model_ml50_new.zip .
COPY --chown=languagetool --from=ftbuild /lid.176.bin .

ENV Java_Xms=512m
ENV Java_Xmx=1500m
ENV langtool_fasttextBinary=/LanguageTool/fasttext
ENV langtool_ngramLangIdentData=/LanguageTool/model_ml50_new.zip
ENV langtool_fasttextModel=/LanguageTool/lid.176.bin

You can then build it with:

docker build -t docker-languagetool-fasttext .

And then you would run it like so (this is based off your command you provided above):

docker run -d --name="Languagetool" \
-p 8081:8010/tcp \
-e langtool_languageModel=/ngrams \
-v "/mnt/hdd1/languagetool/ngrams":"/ngrams" \
--restart=unless-stopped \
docker-languagetool-fasttext
Aculeasis commented 2 years ago

Yes, it starts and i have the same problem with java.lang.OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.fst.FST.<init>(FST.java:387)
    at org.apache.lucene.util.fst.FST.<init>(FST.java:313)
    at org.apache.lucene.codecs.blocktree.FieldReader.<init>(FieldReader.java:91)
    at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:231)
    at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:446)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:261)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:341)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:104)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:65)
    at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:58)
    at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:50)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:50)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
    at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel$LuceneSearcher.<init>(LuceneSingleIndexLanguageModel.java:241)
    at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel$LuceneSearcher.<init>(LuceneSingleIndexLanguageModel.java:229)
    at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.getCachedLuceneSearcher(LuceneSingleIndexLanguageModel.java:182)
    at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.addIndex(LuceneSingleIndexLanguageModel.java:118)
    at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.<init>(LuceneSingleIndexLanguageModel.java:95)
    at org.languagetool.languagemodel.LuceneLanguageModel.<init>(LuceneLanguageModel.java:65)
    at org.languagetool.Language.initLanguageModel(Language.java:180)
    at org.languagetool.language.English.getLanguageModel(English.java:144)
    at org.languagetool.JLanguageTool.activateLanguageModelRules(JLanguageTool.java:566)
    at org.languagetool.server.Pipeline.activateLanguageModelRules(Pipeline.java:121)
    at org.languagetool.server.PipelinePool.createPipeline(PipelinePool.java:204)
    at org.languagetool.server.PipelinePool.getPipeline(PipelinePool.java:180)
    at org.languagetool.server.TextChecker.getPipelineResults(TextChecker.java:757)
    at org.languagetool.server.TextChecker.getRuleMatches(TextChecker.java:711)
    at org.languagetool.server.TextChecker.access$000(TextChecker.java:56)
    at org.languagetool.server.TextChecker$1.call(TextChecker.java:427)
    at org.languagetool.server.TextChecker$1.call(TextChecker.java:420)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Erikvl87 commented 2 years ago

Sorry that I've kept you waiting. I unfortunately didn't had the time yet to look into this. I'll do my best to take a look soon. Meanwhile, would the provided solution of @dprothero work in combination with increasing the memory options?

You can do this by increasing the Java_Xms and Java_Xmx variables. In the Dockerfile example given above, that means increasing these lines (e.g. to 1g and 2g respectively):

ENV Java_Xms=512m
ENV Java_Xmx=1500m

Alternatively, take a look at the Java heap size settings explained over here: https://github.com/Erikvl87/docker-languagetool#java-heap-size

Erikvl87 commented 2 years ago

@Aculeasis, The provided solution of @dprothero seems to work here as well.

I think the example above is useful to include in the README.md so I will keep this ticket open until I've updated the readme file.

Aculeasis commented 2 years ago

Sorry for delay. I set 1g and 2g. It works but falls sometimes. So, I set 2 and 4 it works well. But, 4 GB is it not too much?

Erikvl87 commented 2 years ago

@Aculeasis That should be a question for the official LanguageTool developers. From what I could find is that they don't have an official set of requirements regarding memory configuration:

There's no general rule, it depends on the number of languages being used, the concurrent requests, the text length etc. 2600MB should be enough for most use cases, if you don't have that much, try with less and see how that works.

Source: https://github.com/languagetool-org/languagetool/issues/902#issuecomment-366427622

FarisZR commented 2 years ago

Is there a reason this can't be included in the docker image?