cuzzo / node-stanford-postagger

A client for the Stanford Part of Speech Tagger XMLRPC server.
BSD 2-Clause "Simplified" License
71 stars 13 forks source link

Dockerfile cuzzo/stanford-pos-tagger #3

Open blindsteal opened 8 years ago

blindsteal commented 8 years ago

Hi, I was wondering if there is any chance you could make the Dockerfile for your image available on github/docker repo? From what I see here it is mainly installing deps and running your server script. I would like to make the dictionary configurable for other languages (probably an env variable works best for that), currently this works only by using your image as a base and setting a new entrypoint. Greetings and nice work dockerizing this :+1:

cuzzo commented 8 years ago

Hey @blindsteal,

That sounds like an awesome idea. I'm working on Dockerizing CoreNLP right now, actually. That being said, do you have any idea for how to set the language? I've only ever used English. But I'd love to add support for others.

Thanks!

blindsteal commented 8 years ago

AFAIK you only need to change the -model parameter to use the correct model (i.e. german-fast.tagger for german). Making that configurable in your dockerfile with an env-variable should be enough, then we could run with something like docker run -e MODEL=german-fast.tagger [...]. Of course this would mean we also need a way to include the corresponding model files (either by allowing to mount a volume or by including the "full" distribution in your image). Can you provide the current dockerfile so I can build the image myself and see if it works?

When you say CoreNLP, you mean the CoreNLP dedicated server? I would love to see this. If you've got a repo somewhere I would be glad to help.

cuzzo commented 8 years ago

@blindsteal,

Unfortunately, I think the Dockerfile is lost. I thought it was a part of this repo, but considering I don't have the computer I wrote it on anymore, I'm doubtful I'll find it. Supposedly, there's a way to generate a Dockerfile from an image, but I don't think it'll be very helpful here.

But as far as CoreNLP goes, there's a Python server that makes most of it pretty easy. It uses this awful command line parsing logic to get the results rather than using RPC, probably because StanfordNLP doesn't support that by default.

The cool thing about this is that it used a plugin to get the results via RPC, so it was a lot more efficient. The downside is that it only works with POS tagging. This Scala RPC service seems mighty promising. It's already Dockerized. Dunno if it's configurable, but if it isn't, I'd love to make it so and document it!

Do you have a Gmail? Would be nice to chat about this. Seems like you know more about CoreNLP than me [=

Cheers,

blindsteal commented 8 years ago

@cuzzo

Sorry for the delay, work is keeping me busy... I probably know less about CoreNLP than you, actually I just found it a couple of days ago, but I know Docker pretty well.

After having a quick look I noticed they have their own dedicated server exposing a restful interface (which should be easy enough to dockerize), is there a reason why you prefer RPC?

Concerning the Dockerfile for your image: my first comment contains a link to one of the sites mentioned in the SO thread, and after looking at it again I think it should actually be pretty easy to reconstruct it from there (you can see all commands, including installed packages etc). Quick C&P:

FROM ubuntu:latest # correct base image ?
RUN sed 's/main$/main universe/' -i /etc/apt/sources.list
RUN apt-get update && apt-get install -y software-properties-common python-software-properties
RUN add-apt-repository ppa:webupd8team/java -y
RUN sudo apt-get update
RUN echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections
RUN apt-get install -y oracle-java7-installer
RUN mkdir -p /home/cuzzo/stanford
ADD . /home/cuzzo/stanford
EXPOSE 9000 9000
CMD --port 9000
ENTRYPOINT /bin/sh -c /home/cuzzo/stanford/run-server.sh /home/cuzzo/stanford/models/left3words-distsim-wsj-0-18.tagger 9000

I do have Gmail too: philipp.guenther.lpz(at)gmail(dot)com. On a side note, I just noticed CoreNLP is not free for commercial projects which sadly makes it a lot less attractive for the project I'm working on... I'm still interested in running a dockerized version for testing purposes tho.

cuzzo commented 8 years ago

Hey @blindsteal,

Awesome find.

This definitely didn't exist when I started using CoreNLP, but now that it does, I'm definitely going to take advantage of it. The default server even lets you specify the annotators on the fly, which is really cool. Haven't seen any other servers that let you do that.

I'm working on a new (much smaller) Docker image to just use the standard CoreNLP server.

Thanks!