Georgetown-IR-Lab / OpenNIR

An end-to-end neural ad-hoc ranking pipeline.
https://opennir.net
MIT License
150 stars 25 forks source link

Interest in Docker? #24

Closed scottpchow23 closed 4 years ago

scottpchow23 commented 4 years ago

I plan on dockerizing OpenNIR to attempt to reproduces CEDR_KNRM results on XSEDE's Comet compute cluster as a part of a class project. Are there any potential pitfalls with dockerizing this application? I'm rather new to machine learning and information retrieval in general, but I don't see any obvious problems with this.

I'm planning on dockerizing with the following parameters:

Also, are you open to a PR if I get this working?

seanmacavaney commented 4 years ago

Hi Scott,

Ubuntu 18.04 / Python 3.6 should be a good setup. I'm no Docker expert, but I would not expect any problems.

If you need to make changes to get it working with Docker, I'd be happy to review+approve a PR!

- sean

scottpchow23 commented 4 years ago

So I figured out there's actually one more dependency which is a java jdk.

Java 11 seems to work fine; can you confirm if that is acceptable?

seanmacavaney commented 4 years ago

Looks like my machine is running Java 1.8. I'm not sure if running with Java 11 will cause issues -- you should check the compatibility with the Anserini and pyjnius. From this paper, there appears to be some effect between 8 and 11, but it appears to be minimal.

- sean

scottpchow23 commented 4 years ago

So here's a preview of the dockerfile I have so far:

FROM python:3.6-buster

WORKDIR /workspace

# Copy openNIR files into /workspace
COPY . .

# Install python dependencies
RUN pip install -r requirements.txt

# Install java 11
RUN apt-get update -y
RUN apt-get install openjdk-11-jdk -y

CMD ["/bin/bash"]

While this unfortunately doesn't respect the Java 8 and Ubuntu 18.04 dependencies that you have (it uses Java 11 and Debian 10), I can confirm that I'm able to begin training in the container with the following command:

scripts/pipeline.sh config/conv_knrm config/antique

If you're still interested having this added to the repo, I'd love to open a PR with the dockerfile as well as instructions/caveats on how to run OpenNIR in Docker.

Fun note: I didn't realize how memory hungry loading vectors could be! Loading vectors into memory can easily take up 12-15 GB of RAM and I had to expand the resources for my docker image to get it to not error out on that portion of the training.

seanmacavaney commented 4 years ago

Awesome- thanks! Yeah, go ahead and make a PR with the dockerfile and instructions on how to use it. Others will likely find this helpful.

RE: loading vectors: there's probably a better way I could do this that's less resource intensive :)