Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

Installing on fedora 40 #72

Closed thfrkielikone closed 3 months ago

thfrkielikone commented 5 months ago

Hi. I am trying to install opus-filter on fedora 40 in a docker container and have had to do the following workarounds to install it (here's the dockerfile for reproducibility):

FROM fedora:40
RUN dnf install -y gcc g++ git cmake make boost-devel swig ninja-build python3-devel glib2-devel re2-devel cld2-devel clang
# clang is needed by opus-fast-mosestokenizer, which is hardcoded to try to build boost (out of all things) with clang
RUN python3 -m ensurepip

# Install fastText from source; pypi version doesn't install (core dep)

RUN git clone https://github.com/facebookresearch/fastText.git
WORKDIR /fastText
RUN pip3 install .
WORKDIR /

# mosestokenizer's buildscripts can't find pybind11
RUN CMAKE_PREFIX_PATH=/usr/local/lib/python3.12/site-packages/pybind11/share/cmake/pybind11/ pip install opus-fast-mosestokenizer

# install opusfilter itself
RUN git clone https://github.com/Helsinki-NLP/OpusFilter.git
WORKDIR /OpusFilter

# Simply patch out pycld2; I just couldn't make it build (patch below)

ADD patch-opus.patch patch-opus.patch
RUN git apply patch-opus.patch

RUN pip install .
WORKDIR /

# I happen to need eflomal for what I am doing, but the version in pypi is far out of date from the one in the repo
# (opus-filter needs a version that's newer than the one on pypi)

RUN git clone https://github.com/robertostling/eflomal.git
WORKDIR /eflomal
RUN pip install .

# mosestokenizer can't find its .so-files
ENV LD_LIBRARY_PATH=/usr/local/lib64/python3.12/site-packages/mosestokenizer/lib

patch-opus.patch:

diff --git a/setup.py b/setup.py
index e5ca787..d1f8b55 100644
--- a/setup.py
+++ b/setup.py
@@ -14,7 +14,7 @@ install_requires = [
     "morfessor",
     "opus-fast-mosestokenizer>=0.0.8.5",
     "pandas>=1.0.0",
-    "pycld2",
+#    "pycld2",
     "xxhash>=3.2.0",
     "sentence-splitter",
     "rapidfuzz",

My issues are thus:

svirpioj commented 5 months ago

Hi,

How does one properly build opus-fast-mosestokenizer such that it finds pybind11 and works without LD_LIBRARY_PATH hacks? Can this be fixed upstream? (Issues are disabled from the repo so I am mentioning this here)

I actually wasn't aware that issues were not enabled for our opus-fast-mosestokenizer fork. I opened them now, if you could move the related discussion there.

Is fastText a relevant dep for the non-extra collection of deps? The git repo has been archived so it seems somewhat dead: https://github.com/facebookresearch/fastText

A reasonable suggestion, I think we can move fasttext to extras. Especially now that also lingua is supported, there's another good option for language detection.

Is cld2 a relevant dep for the non-extra collection of deps? It doesn't look as dead but at least I couldn't get the obvious ways to install it or to compile it to work. (I can ofc ask upstream about how it should be installed or compiled)

Indeed the same as above applies here.

Can eflomal's pypi package be updated? The library itself works perfectly when built from the repo, so this seems to be a strictly packaging-related problem.

PyPI does have the latest version of eflomal. I guess your problem is that it's tagged as a pre-release version (1.0.0b1), while the old version is tagged as 0.1, so the latter is installed if not additional restrictions are set. The packaging update was changing quite a lot of things (see details at here) and I wasn't sure if that would be the final shape of things, so didn't want to go to final 1.0.0 release.

thfrkielikone commented 5 months ago

Thanks for taking the time to answer. I detailed the opus-fast-tokenizer issues in the appropriate repo. The eflomal update was merged over a year ago, is there a current reason to keep it in the pre-release state? (I should have looked at the exact version when installing, though.)

svirpioj commented 4 months ago

Eflomal has now version 2.0.0 in PyPI. (This was a bit confusing, but 1.0.0 was actually tagged before 1.0.0b1, and as the new version is incompatible, I had to increase the major version to 2).

svirpioj commented 4 months ago

The fasttext and pycld2 libraries have now been changed to optional. @thfrkielikone, can you confirm if everything works now? The changes are in the develop branch.

thfrkielikone commented 4 months ago

eflomal installs:

FROM fedora:40
RUN python3 -m ensurepip
RUN dnf install -y gcc g++ git cmake make
RUN pip3 install eflomal

opus-filter installs

FROM fedora:40
RUN python3 -m ensurepip
RUN dnf install -y gcc g++ git cmake make
RUN git clone https://github.com/Helsinki-NLP/OpusFilter.git
WORKDIR /OpusFilter
RUN git checkout develop
RUN pip3 install .

And both seem to work in practice in my environment. Again, thanks for fixing this.