chaquo / chaquopy

Chaquopy: the Python SDK for Android
https://chaquo.com/chaquopy/
MIT License
796 stars 130 forks source link

Please update tokenizers and transformers version #607

Open ominfowave opened 2 years ago

ominfowave commented 2 years ago

Please add tokenizers version 0.11.1, it is a requirement for some of the latest python modules like indic-punct.

mhsmith commented 2 years ago

We currently offer the following tokenizers versions:

Both of these are currently only available for Python 3.8. To change the Python version of your app, see here.

As you can see from its setup.py file, indic-punct pins all of its requirements to specific versions. With packages that do this, it's sometimes possible to get them working by specifying whatever is the closest version available in the Chaquopy repository:

                install "indic-punct"
                install "torch==1.8.1"
                install "torchvision==0.9.1"
                install "transformers==4.15.0"
                install "tokenizers==0.7.0"

In this case I've used the closest newer version of each requirements, but sometimes you might need to use the closest older one.

mhsmith commented 2 years ago

Unfortunately, the current version of indic-punct (2.1.4) also has a native requirement which Chaquopy doesn't support at all (pynini). It's possible that one of the older versions of indic-punct doesn't have this requirement, but the release history is confusing (8 releases in one day, and no tags on GitHub), so that's something you'd have to look into by yourself.

See also #608.

mhsmith commented 2 years ago

We're not planning to update this package in the near future, but if you'd like to try building the new version yourself, follow the instructions here. However, our package build tool doesn't currently have working support for Rust – see #1030 for details.

If anyone else needs a newer version of tokenizers, please click the thumbs up button above, and post a comment explaining why you need it.

Benoit-W commented 1 year ago

Hello, I am trying to use some recent model from transformers which require more recent tokenizer version (transformers 4.23.1 or higher which require tokenizers!=0.11.3,<0.14,>=0.11.1) but as i saw on #608 it seems to be a bit complicated because of rust. I would like to know if there are some update about tokenizers library planned soon.

mhsmith commented 1 year ago

Sorry, we have no update planned in the near future. But if you'd like to try updating it yourself, see the links in my previous comment.

Our current tokenizers versions are listed in my comment above. If none of those would work for your project, please post a comment explaining why.

melink14 commented 10 months ago

Looks like I also need an updated version of tokenizers package for working with manga-ocr (Requires transformers >= 4.25.0

Failed to install tokenizers<0.15,>=0.14 from https://files.pythonhosted.org/packages/b2/b9/bf025d763bbdd333cb88bedb23426f932c5b4a6ce6f033c498517fad5b90/tokenizers-0.14.1.tar.gz#sha256=ea3b3f8908a9a5b9d6fc632b5f012ece7240031c44c6d4764809f33736534166 (from transformers>=4.25.0->manga-ocr).

I've added my thumbs up and might lo0ok at the instructions to install myself later if I have time.

mhsmith commented 10 months ago

Thanks – I haven't checked, but you may be able to work around this by using an older version of manga-ocr.

pcrwebdesign commented 5 months ago

In my case I need version 0.13.3 because it is a requirement of faster-whisper. In case it helps others I have made some progress updating it myself by:

  1. Building my own versions of openssl for each abi (mimicking cryptography's approach) and setting OPENSSL__LIB_DIR and OPENSSL_INCLUDE_DIR to the resulting directories.
  2. Setting RUSTUP_TOOLCHAIN to 1.72.1 to avoid error due to the stricter newer rust compiler. See stackoverflow answer
  3. Modifying the generated Cargo.toml to lower the version of the clap dependency (to 4.4.18) because the existing one requires a higher version of rustc (see point 2)

However I am blocked due to the build-wheel.sh script setting env["_PYTHON_HOST_PLATFORM"] = f"linux_{ABIS[self.abi].uname_machine}" which overrides sysconfig.get_platform() returning a value without a dash, thus causing setuptools_rust.build.get_dylib_ext_path to crash.

I wonder if someone knows the reasoning for setting that env variable and/or the consequences of unsetting it or setting it to a different value that conforms to the usual {osname}-{release}-{machine}.

mhsmith commented 5 months ago

I don't remember exactly why we added that variable; you can probably find out from the Git history. But going by the sysconfig.get_platform documentation, I agree it should use a dash rather than an underscore, but without a version number on Linux.

choyuansu commented 5 months ago

I needed a module in a more recent version of transformers, which requires tokenizers>=0.14.

I tried building a wheel for tokenizers==0.15.2 following this README and met this error:

Error log ``` warning: esaxx-rs@0.1.10: src/esaxx.cpp:620:10: fatal error: 'cstdint' file not found warning: esaxx-rs@0.1.10: #include warning: esaxx-rs@0.1.10: ^~~~~~~~~ warning: esaxx-rs@0.1.10: 1 error generated. error: failed to run custom build command for `esaxx-rs v0.1.10` Caused by: process didn't exit successfully: `/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-43d93e9b64a75770/build-script-build` (exit status: 1) --- stdout TARGET = Some("x86_64-unknown-linux-gnu") OPT_LEVEL = Some("3") HOST = Some("x86_64-unknown-linux-gnu") cargo:rerun-if-env-changed=CXX_x86_64-unknown-linux-gnu CXX_x86_64-unknown-linux-gnu = None cargo:rerun-if-env-changed=CXX_x86_64_unknown_linux_gnu CXX_x86_64_unknown_linux_gnu = None cargo:rerun-if-env-changed=HOST_CXX HOST_CXX = None cargo:rerun-if-env-changed=CXX CXX = Some("/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++") cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS CRATE_CC_NO_DEFAULTS = None DEBUG = Some("false") cargo:rerun-if-env-changed=CXXFLAGS_x86_64-unknown-linux-gnu CXXFLAGS_x86_64-unknown-linux-gnu = None cargo:rerun-if-env-changed=CXXFLAGS_x86_64_unknown_linux_gnu CXXFLAGS_x86_64_unknown_linux_gnu = None cargo:rerun-if-env-changed=HOST_CXXFLAGS HOST_CXXFLAGS = None cargo:rerun-if-env-changed=CXXFLAGS CXXFLAGS = Some("") running: "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "--target=x86_64-unknown-linux-gnu" "-I" "src" "-std=c++11" "-o" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-5858a4f309d526f4/out/src/esaxx.o" "-c" "src/esaxx.cpp" cargo:warning=src/esaxx.cpp:620:10: fatal error: 'cstdint' file not found cargo:warning=#include cargo:warning= ^~~~~~~~~ cargo:warning=1 error generated. exit status: 1 --- stderr error occurred: Command "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "--target=x86_64-unknown-linux-gnu" "-I" "src" "-std=c++11" "-o" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-5858a4f309d526f4/out/src/esaxx.o" "-c" "src/esaxx.cpp" with args "aarch64-linux-android21-clang++" did not execute successfully (status code exit status: 1). warning: build failed, waiting for other jobs to finish... 💥 maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.8-64bit" PYO3_PYTHON="/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python" PYTHON_SYS_EXECUTABLE="/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/Cargo.toml" "--release" "--lib"` Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python', '--compatibility', 'off'] returned non-zero exit status 1 build-wheel: Error: Backend subprocess exited when trying to invoke build_wheel ```

Not sure how to proceed from here. Any help is appreciated.

mhsmith commented 5 months ago

This appears to be caused by the --target option, which is unnecessary because the target is already encoded into the compiler launcher. You'd have to examine the build system to work out how to remove the option, but unfortunately I don't know any more than that.

choyuansu commented 5 months ago

@mhsmith Thanks for the hint. I now switched to building inside a docker container, and I'm getting a different error: build-wheel: Error: /workdir/chaquopy/server/pypi/packages/tokenizers/build/0.14.1/cp38-cp38-android_21_arm64_v8a/fix_wheel/tokenizers/tokenizers.so is linked against unknown library 'libstdc++.so.6'.

Here's the Dockerfile and docker-compose.yaml I used, and some other changes to help reproduce the error:

Dockerfile ``` FROM python:3.8.18-slim-bookworm RUN apt update && apt install -y \ patch \ patchelf \ unzip \ curl \ build-essential \ wget WORKDIR /workdir COPY server/pypi/requirements.txt /workdir RUN pip install -r requirements.txt RUN curl https://sh.rustup.rs -sSf | bash -s -- -y ENV ANDROID_HOME=/workdir/chaquopy/server/pypi/android-sdk ```
docker-compose.yaml ``` services: build-wheel: build: context: . dockerfile: Dockerfile volumes: - .:/workdir/chaquopy command: - bash - -ecl - | # download target if not exist cd /workdir/chaquopy if [ ! -d /workdir/chaquopy/maven/com/chaquo/python/target/3.8.18-0 ]; then target/download-target.sh maven/com/chaquo/python/target/3.8.18-0 fi # build wheel cd /workdir/chaquopy/server/pypi ./build-wheel.py --python 3.8 --abi arm64-v8a tokenizers ```
Other changes ``` diff --git a/.dockerignore b/.dockerignore index 775ef4ae..10b58a74 100644 --- a/.dockerignore +++ b/.dockerignore @@ -12,6 +12,7 @@ !server/pypi/pkgtest !server/pypi/dist !server/pypi/piptest +!server/pypi/requirements.txt **/.gradle/ **/.idea/ diff --git a/server/pypi/packages/tokenizers/meta.yaml b/server/pypi/packages/tokenizers/meta.yaml index 9d4b96f8..3932e56f 100644 --- a/server/pypi/packages/tokenizers/meta.yaml +++ b/server/pypi/packages/tokenizers/meta.yaml @@ -1,7 +1,7 @@ package: name: tokenizers - version: "0.10.3" + version: "0.15.2" requirements: build: - - setuptools-rust 0.11.6 \ No newline at end of file + - setuptools-rust 0.11.6 diff --git a/server/pypi/packages/tokenizers/patches/chaquopy.patch b/server/pypi/packages/tokenizers/patches/chaquopy.patch deleted file mode 100644 index 50b3601b..00000000 --- a/server/pypi/packages/tokenizers/patches/chaquopy.patch +++ /dev/null @@ -1,51 +0,0 @@ ---- src-original/setup.py 2020-04-17 16:57:37.000000000 +0000 -+++ src/setup.py 2021-01-12 23:57:10.005615920 +0000 -@@ -1,6 +1,38 @@ - from setuptools import setup - from setuptools_rust import Binding, RustExtension - -+ -+# BEGIN Chaquopy additions -+import os -+from os.path import abspath, dirname, exists -+from subprocess import check_call -+import sys -+ -+triplet = os.environ["CHAQUOPY_TRIPLET"] -+rust_toolchain = open("rust-toolchain").read().strip() -+check_call(["rustup", "toolchain", "install", rust_toolchain]) -+check_call(["rustup", "target", "add", "--toolchain", rust_toolchain, triplet]) -+ -+os.environ["CARGO_BUILD_TARGET"] = triplet -+sysroot = abspath(f"{dirname(os.environ['CC'])}/../sysroot") -+py_version = "{}.{}".format(*sys.version_info[:2]) -+os.environ["PYO3_CROSS_INCLUDE_DIR"] = f"{sysroot}/usr/include/python{py_version}" -+os.environ["PYO3_CROSS_LIB_DIR"] = f"{sysroot}/usr/lib" -+ -+os.makedirs(".cargo", exist_ok=True) -+config_filename = ".cargo/config.toml" -+config = f"""\ -+[target.{triplet}] -+ar = "{os.environ['AR']}" -+linker = "{os.environ['CC']}" -+""" -+if exists(config_filename) and open(config_filename).read() != config: -+ raise Exception(f"{config_filename} exists with different content") -+with open(config_filename, "w") as config_file: -+ config_file.write(config) -+# END Chaquopy additions -+ -+ - extras = {} - extras["testing"] = ["pytest"] - -@@ -15,7 +47,8 @@ - author_email="anthony@huggingface.co", - url="https://github.com/huggingface/tokenizers", - license="Apache License 2.0", -- rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)], -+ rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, -+ rustc_flags=[f"-lpython{py_version}"])], # Chaquopy - extras_require=extras, - classifiers=[ - "Development Status :: 5 - Production/Stable", ```

I also found this comment suggesting adding the -stdlib=libstdc++ option, but I'm not sure where to add that.

Hope you can help me solve this error. Thanks!

mhsmith commented 5 months ago

Sorry, I don't have time to look into this in any detail. But libstdc++.so.6 is a Linux library name which should never appear in an Android build, so this is probably caused by the build using a mixture of Android and Linux elements.

divyanshluthra commented 4 months ago

Hey, Hope you are doing well. I am facing issues while trying to pip install anthropic which has a dependency of tokenizer>=0.13. I tried with 0.13 version, but i get the attached errors. Could you please guide as to how we can work around this issue. Regards Divyansh tokenizer.log

mhsmith commented 4 months ago

You could try using an older version of anthropic. Looking back through the blame of anthropic's pyproject.toml, the last version which didn't require such a new version of tokenizers was anthropic 0.2.10. That came out less than a year ago, but this is obviously a fast-moving package, so I don't know if that would be acceptable for you.

divyanshluthra commented 4 months ago

You could try using an older version of anthropic. Looking back through the blame of anthropic's pyproject.toml, the last version which didn't require such a new version of tokenizers was anthropic 0.2.10. That came out less than a year ago, but this is obviously a fast-moving package, so I don't know if that would be acceptable for you.

Luckily, the tokenizer version 0.10.3 has worked with the latest anthropic package so far. I thought to test it regardless of the incompatibility error during build and run, and it worked. Yeah, anthropic older versions are not available to newer users as per their api docs, because of huge changes/improvements in their latest offering "opus". So far so good..