bblanchon / pdfium-binaries

📰 Binary distribution of PDFium
908 stars 195 forks source link

Conda package #119

Closed mara004 closed 1 year ago

mara004 commented 1 year ago

Hi @bblanchon,

There have been some requests (e.g. from doctr) that pypdfium2 should have a conda package. Basically the pre-requisite for this would be a conda package for pdfium (see below).

The Readme says:

I can provide packages for your favorite package manager, but I need help from someone who knows the format.

Unfortunately, my knowledge about conda is very limited as I don't use it personally, but this would be my idea of the workflow:

Presumably we need the action conda-incubator/setup-miniconda.


Why we need this

Initially I intended to bundle the binaries in pypdfium2's conda packages (as we do for PyPI wheels), but it turns out this is not really the way to go:

mara004 commented 1 year ago

CC @frgfm @felixT2K (https://github.com/mindee/doctr/issues/113)

bblanchon commented 1 year ago

Hi @mara004,

I see that you already created the package 👍 https://anaconda.org/anaconda/pdfium-binaries

Let me know if you want to add it to the CI.

Best regards, Benoit

mara004 commented 1 year ago

Hi @bblanchon,

oh, but that was not me! 🙃 When I searched before posting this issue I couldn't find that package yet. I think it was @boldorider4, who mentioned he's packaging pypdfium2 as part of a customer project, but (unfortunately) intends to make the result private. However, apparently he publishes any dependencies of pypdfium2 that also need to be packaged, at least for the time being ... ?

However, his builds require manual interaction, and you can see from the https://anaconda.org/main/pdfium-binaries/files tab that he's not building the whole intersection of platforms supported by pdfium-binaries and conda, e.g. linux-32, linux-armv7l, win-32 and win-arm64 are missing.

So I think it would still be good to integrate conda into pdfium-binaries CI. I can give this a try but am afraid my knowledge (and opinion) of bash scripts is poor.

bblanchon commented 1 year ago

Don't worry, I can handle the CI part 😎; I just need to know what to do. @boldorider4, could you give us some direction on how to build this package?

mara004 commented 1 year ago

boldorider4, could you give us some direction on how to build this package?

@bblanchon I've just made a draft PR with packaging except for the CI part: https://github.com/bblanchon/pdfium-binaries/pull/120 Unfortunately I'm not sure how to test it, the workflow doesn't want to run from a non-master branch in a fork...

bblanchon commented 1 year ago

I'll integrate this into the build and get back to you.

bblanchon commented 1 year ago

The first batch of Conda packages is available: https://github.com/bblanchon/pdfium-binaries/actions/runs/5986613758#artifacts How should we test them?

mara004 commented 1 year ago

Thanks a lot for your effort! We definitely need to think about testing. However, even before that we can notice that something must be wrong with the osx packages, they're much smaller and just don't contain the binaries.

bblanchon commented 1 year ago

Nice catch! I’ll fix the glob.

mara004 commented 1 year ago

Also, while the windows packages do contain their binaries, I think they're not properly following conda conventions yet.

https://docs.conda.io/projects/conda-build/en/latest/user-guide/environment-variables.html says

Unix-style packages on Windows, which are usually statically linked to executables, are built in a special Library directory under the build prefix.

See also the table from https://github.com/bblanchon/pdfium-binaries/issues/121#issuecomment-1691998538. (Not sure if this is strictly required, though.)

bblanchon commented 1 year ago

That part didn’t make any sense so I decided to use the same layout and see if it works.

mara004 commented 1 year ago

Separate thread: when trying to install the linux package locally using conda install "./linux-64/pdfium-binaries-118.0.5961-0.tar.bz2" it breaks with conda.auxlib.exceptions.ValidationError: Invalid value 0 for build_number.

mara004 commented 1 year ago

That part didn’t make any sense so I decided to use the same layout and see if it works.

Hmm, I think that's just some conda convention to put stuff in a Library/ subdirectory on Windows. I'm not sure if conda install maybe handles that specially. Personally I'd just try to lay out packages as similar to conda build as possible for robustness in case receiver tools rely on these peculiarities. But if we can test it and it works I'm of course OK with it too.

bblanchon commented 1 year ago

I downloaded the packages for lxml and they share the same layout: https://repo.anaconda.com/pkgs/main/win-64/lxml-4.9.2-py39h2bbff1b_0.tar.bz2 https://repo.anaconda.com/pkgs/main/linux-64/lxml-4.9.2-py39h5eee18b_0.tar.bz2 See? No Library\lib nonsense.

However, they don't have the same layout as ours. For example, here are the locations of the shared lib: lib\python3.9\site-packages\lxml\etree.cpython-311-x86_64-linux-gnu.so Lib\site-packages\lxml\etree.cp39-win_amd64.pyd

PS: I just started another build that should fix the missing dylib and the build_number

mara004 commented 1 year ago

I downloaded the packages for lxml and they share the same layout: See? No Library\lib nonsense.

Well, lxml is a python package (with API level bindings). I'm under the impression conda python packages might have a bit of an own layout. It would be interesting to find another "C library only" windows package and see what layout it uses.

PS: I just started another build that should fix the missing dylib and the build_number

Thanks!

mara004 commented 1 year ago

and the build_number

I just tried with https://github.com/bblanchon/pdfium-binaries/actions/runs/5991210170, but am still getting the same crazy error (full log attached: error.txt). I'm starting to fear conda might not allow installing local packages, or needs the special channel files as conda build generates them? 😖

mara004 commented 1 year ago

Wait I think there's just something wrong with my local conda; I get this error also with other commands and packages. I'll try to create a fresh env.

mara004 commented 1 year ago

OK conda clean did the trick. The build_number error might have been for unrelated reasons, I'm not sure. However, I now get a new installation error:

pdfium-binaries-118. | ############################################################################################################################################################################################################# | 100% 

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/usr/lib/python3.11/site-packages/conda/exceptions.py", line 1114, in __call__
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/main.py", line 86, in main_subshell
        exit_code = do_call(args, p)
                    ^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/conda_argparse.py", line 90, in do_call
        return getattr(module, func_name)(args, parser)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/main_install.py", line 20, in execute
        install(args, parser, 'install')
      File "/usr/lib/python3.11/site-packages/conda/cli/install.py", line 177, in install
        explicit(args_packages, prefix, verbose=not context.quiet)
      File "/usr/lib/python3.11/site-packages/conda/misc.py", line 97, in explicit
        assert not any(spec_pcrec[1] is None for spec_pcrec in specs_pcrecs)
    AssertionError
bblanchon commented 1 year ago

Can you add a pdb.set_trace() and analyze what's going on?

mara004 commented 1 year ago

I'm not used to pdb, but I've now added the trace before the assert and then did p specs_pcrecs. It is ([MatchSpec("packages/linux-64::pdfium-binaries==118.0.5961.0=1"), None],) (with packages/ being the parent folder, and the invocation being conda install --offline "./linux-64/pdfium-binaries-118.0.5961.0-1.tar.bz2"). I tried to take a look at the surrounding code. PackageCacheData.query_all(spec) apparently fails by returning an empty iterator. But I have no idea what this code is supposed to achieve.

That said, I don't think it leads us anywhere, and this is not very nice to debug because the code is in /usr.

mara004 commented 1 year ago

FWIW when creating a (python-specifc) conda package for pypdfium2 using conda build, I am able to install it locally. This is what the directory layout created by conda build looks like: image

And this is how specs_pcrecs looks:

([MatchSpec("out/linux-64::pypdfium2==5.0.0b0=py311h9bf148f_0"), PackageCacheRecord(_hash=6400806049712209450, name='pypdfium2', version='5.0.0b0', build='py311h9bf148f_0', build_number=0, channel=Channel("out/linux-64"), subdir='linux-64', fn='pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', url='file:///home/mara/projects/pypdfium2/conda/out/linux-64/pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', sha256='39784aa3711fa095225bd783251b7526d880d194e95146108136edceb79f087e', arch='x86_64', platform='linux', depends=('libgcc-ng >=11.2.0', 'python >=3.11,<3.12.0a0'), constrains=(), license='Apache-2.0 or BSD-3-Clause', timestamp=1693155445.834, size=2962164, package_tarball_full_path='/home/mara/.conda/pkgs/pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', extracted_package_dir='/home/mara/.conda/pkgs/pypdfium2-5.0.0b0-py311h9bf148f_0', md5='db707ac8e6713558904680088ca367bc')],)
mara004 commented 1 year ago

OK, no, the surroundings don't matter. When isolating the pypdfium2 package it is still installable. The docs also say it should be possible to install a downloaded package: https://docs.anaconda.com/free/anaconda/packages/install-packages/#installing-packages-on-a-non-networked-air-gapped-computer

bblanchon commented 1 year ago

With the same command, I get:

The current version of conda is too old to install this package. (This version only supports paths.json schema version 1.) Please update conda to install this package.

It turns out, it expects an undocumented paths_version at the root of paths.json 🙄

bblanchon commented 1 year ago

After adding paths_version, the package installs correctly, however, I'm a bit surprised by the location of the installed files. On my Docker setup, it installs files in /opt/conda with all the other conda files. It's acceptable for /opt/conda/lib/libpdfium.so but the .h files are mixed with hundreds of other ones.

bblanchon commented 1 year ago

New batch: https://github.com/bblanchon/pdfium-binaries/actions/runs/5997229179#artifacts

mara004 commented 1 year ago

Interesting. Unfortunately I'm still stuck with the same error desipte the new build. If you're able to install the same package it might just be something wrong with my setup. I installed conda from fedora 37 repos (not sure if it's up-to-date), think I should better have used the official miniconda installer.

bblanchon commented 1 year ago

I tried again with the Docker images continuumio/anaconda3 and continuumio/miniconda3 and it seems to work.

C:\> docker run --rm -v %CD%:/host continuumio/miniconda3 conda install /host/pdfium-binaries-118.0.5975.0-1.tar.bz2

Downloading and Extracting Packages

Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
mara004 commented 1 year ago

Nice.

I will uninstall distro conda (actually more commands randomly fail) and reinstall miniconda cleanly, then test the linux package again. It might take some time. You don't need to wait for me if you think this is ready.

PS: The full version (e.g. 118.0.5975.0) seems inconvenient for pinning or specifying bounds. The build tag (e.g. 5975) is already unique of its own and easier to work with. Could the conda packages just use that as version, similar to this repo's tags? IIRC the full version is a bit complicated to get, so changing this would aid integration with pypdfium2.

bblanchon commented 1 year ago

Can't you pin to ==118.* or >=118?

mara004 commented 1 year ago

That often stays the same across different builds. I probably have to pin to an exact build because that's what the bindings file is tied to. Also I don't currently have the full version in pypdfium2 and am not sure how to get/integrate that nicely.

mara004 commented 1 year ago

Oh, and I think we might need to name the package differently (e.g. pdfium-binaries-bblanchon) to avoid collision with @boldorider4's builds.

mara004 commented 1 year ago

I pulled in a fresh miniconda. Effectively the same error, just with a new description: error.txt

bblanchon commented 1 year ago

I'd really prefer if the Conda package had the same numbering scheme as other packages. Since pypdfium2 is independent of pdfium's version, why do you want to "pin" the version?

We must take over the pdfium-binaries, or people will always pick the wrong package. @boldorider4, can you transfer ownership to bblanchon? Here is what I found in the documentation: Transferring a package to a new owner.

@mara004, you're on Fedora, right? How did you install miniconda?

mara004 commented 1 year ago

@mara004, you're on Fedora, right? How did you install miniconda?

Yes. I installed it as documented here: https://docs.conda.io/en/latest/miniconda-install.html https://docs.conda.io/en/latest/miniconda.html

Outdated > I'd really prefer if the Conda package had the same numbering scheme as other packages. Since pypdfium2 is independent of pdfium's version, why do you want to "pin" the version? Is it possible to match against the full version with just the build tag? Then it would be fine as is, it's just not obvious to me how to do that. Unfortunately pypdfium2 is not independent of pdfium version, strictly speaking. The bindings file is generated from headers for a particular version. If e.g. struct fields changed and the bindings mismatch, I think we can get undefined behavior / crashes. In practice it might often work out since pdfium's public API is very stable and ctypesgen wraps the library `.get()` calls with if guards so new functions are not a problem, but the bindings file is only formally safe with the version of pdfium it was generated from. I'm not yet sure if I will actually pin it, but would like to be able to do so in case of necessity.
mara004 commented 1 year ago

Also the shorthand is easy to retrieve through the tags with git ls-remote (that's what we currently use to get release URLs). The full version we'd probably need to parse out of the tarball's version file or the release notes heading, which is more complicated (doable, but not very nice).

mara004 commented 1 year ago

OK nevermind about version, I think we can handle this. I agree the full version is visually nicer and consistency matters. Sorry for the fuss. (Extracting the full version and translating to that from a shorthand version should actually be pretty straightforward. Moreover it's an enhancement to also expose the full version to the user.)

bblanchon commented 1 year ago

I tested again with a Fedora image and it works.

Here is the Dockerfile:

FROM fedora:37

RUN mkdir -p ~/miniconda3
RUN curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o ~/miniconda3/miniconda.sh
RUN bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
RUN rm -rf ~/miniconda3/miniconda.sh
RUN ~/miniconda3/bin/conda init bash

Here is the command I ran:

# conda install --offline pdfium-binaries-118.0.5975.0-1.tar.bz2 

Downloading and Extracting Packages

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
boldorider4 commented 1 year ago

@bblanchon @mara004,

my apologies for my silence. This thread is getting very long very quickly. Could you give me a brief summary what you guys are trying to do? To my understanding, you packaged pdfium-binaries-118.0.5975 yourself, supposedly using a recipe of your own, and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?

I may also package that using our standard recipe methods so you can simply pull it from main and don't have to install it --offline. But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2 (trying to save myself from more effort).

Answering a few questions:

@mara004 Could the conda packages just use that as version, similar to this repo's tags?

you could specify in the conda recipe something along the lines of: source: git_url: git@github.com:bblanchon/pdfium-binaries.git git_rev: "chromium/5975" but it is not really encouraged, it's more condiac to rely on a pre-tarballed source archive. source: url: https://github.com/bblanchon/pdfium-binaries/releases/download/chromium%2F5975/pdfium-v8-mac-arm64.tgz # [osx and arm64] sha256: <sha256> # [osx and arm64]

@bblanchon can you transfer ownership to bblanchon?

If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos). If you really do wish to proceed with your own flavor of pdfium-binaries, I'd suggest to upload it to the conda-forge channel. But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

The good news is that I am very close to patching the setup.py and setupsrc scripts of pypdfium2 so that the setup process seamlessly integrates into the conda eco system. In order not to transform this into maintenance nightmare, @mara004 could have a look at the recipe and related patch (the patch will be part of the same feedstock) and implement a switch into the setup.py that allows switching from the self-contained setup (the one that packages libpdfium.[dylib|so|dll] into a portable package) and the setup that is meant to integrate when running conda build or conda install. I will comment as much as possible into the patch to make this transition easier.

mara004 commented 1 year ago

I tested again with a Fedora image and it works.

@bblanchon OK. This is good because it suggests the packages are fine. Maybe there is some broken leftover from my former distro install of conda. I'll check.

@boldorider4 Thanks for joining this thread (sorry it grew so long). I'll reply to the questions directed at me later when I have more time.

mara004 commented 1 year ago

Answer part 1:

Could you give me a brief summary what you guys are trying to do?

TLDR We're trying to integrate conda packaging into pdfium-binaries CI directly to avoid dependence on a third party and necessity of manual interaction, and to package for all supported platforms.

supposedly using a recipe of your own

@bblanchon wrote a custom script (not a recipe) because of problems with conda build/convert. See the master...conda2 diff.

I may also package that using our standard recipe methods so you can simply pull it from main and don't have to install it --offline.

Due to the direct integration we're trying to achieve, it is not possible to "merge" this into anaconda-recipes/pdfium-binaries-feedstock. (FWIW pdfium-binaries-feedstock does a lot more CI work than necessary and still ends up supporting less platforms, not to mention the very extensive dependence on native hosts.)

and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?

Local installation works fine for @bblanchon but fails for me. I'm currently investigating why. This may not be overly relevant for the purposes of this thread, though.

you could specify in the conda recipe something along the lines of: [...]

We were takling about the version specifier in pdfium-binaries and a possible pin in the upcoming requirements section in pypdfium2's recipe, not about the source section of a third-party style recipe. 😅 However, this is settled, I think I know now how to pin, should the necessity arise. FWIW your first suggestion were definitely not applicable, noone wants to build the binaries again, but of course use the pre-built archives. 🤪

But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2 (trying to save myself from more effort).

See the second paragraph of the Outdated tab of https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1696018361. This is still my current state of knowledge, just not relevant for this thread anymore. TLDR If not pinning/bounding pdfium-binaries in pypdfium2's recipe, ABI breakage (or just a degree of mismatch that renders the bindings unusable) is possible in principle, but comparatively unlikely due to pdfium's API stability practices.

boldorider4 commented 1 year ago

OK, I see what you're doing. I'm afraid my work on the feedstock recipe cannot be of help here. Nonetheless, when my patch for setup.py and setupsrc is ready, I'd like you to have a quick review of it and let me know your thoughts. Ultimately, it would be really cool if you could merge it to the master of pypdfium2 by protecting the code with some sort of switch. Something like:

if os.environ['CONDA_BUILD_ENABLED']:
    # do what my patch does
else:
    # do what setup_base.py used to do, pulling all prebuilt binaries
     from pdfium-binaries and bundling them in the pypdfium package

Where CONDA_BUILD_ENABLED is a new switch settable somewhere, maybe at the toml level. I used an env var because that would be relatively easy to enable in our recipe.

mara004 commented 1 year ago

@boldorider4 You may submit your patch as draft PR, then I can take a look (I'll leave it up to you whether to target main or devel_v5).

However, I doubt if this will actually be necessary. You can already set $PDFIUM_PLATFORM=none[^1] as described here to opt out of pypdfium2's bundling. One could generate the bindings file with correct runtime libdirs externally and copy it to src/pypdfium2/raw.py[^1] ahead of the pip install.

That said, as I plan to do the conda packaging myself properly anyway, I'm not particularly enthusiastic to aid a simultaneous paid third-party packaging project that will probably not even make the results public. I think I'd only merge a patch if it's good and also needed for our purposes.

[^1]: rsp. PDFIUM_BINARY and raw_unsafe.py in v5

bblanchon commented 1 year ago

If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos).

I was talking about the package named "pdfium-binaries" on Conda, not about the GitHub repo.

If you really do wish to proceed with your own flavor of pdfium-binaries, I'd suggest to upload it to the conda-forge channel.

But then which package conda install pdfium-binaries will install?

But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

Do you mean we cannot build the package in a GitHub workflow?

mara004 commented 1 year ago

But then which package conda install pdfium-binaries will install?

If -c $CHANNELNAME is specified, it would install from that channel. Otherwise, I believe it would depend on the user's channel configuration, with anaconda main being the default choice. (This is a guess, please correct me if I'm wrong.)

But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

Do you mean we cannot build the package in a GitHub workflow?

I believe conda-forge, unlike anaconda, does not allow custom uploads, but requires everything to go through the feedstock repos. So we cannot use that. If we will get access to main/pdfium-binaries on anaconda, we'll use that. If we don't, this is bad, but then we can still upload under a different name (see above for suggestions) so dependent recipes can request the packages built here and not leave it up to the end user's channel config.

mara004 commented 1 year ago

BTW, this SO thread sounds related to my offline install problems. I once ran conda clean --all in the previous install, but @kalefranz writes

But take note: if you ever want to do any type of --offline operations, don't use --all; be more selective.

mara004 commented 1 year ago

RE https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1695224193:

however, I'm a bit surprised by the location of the installed files. On my Docker setup, it installs files in /opt/conda with all the other conda files. It's acceptable for /opt/conda/lib/libpdfium.so but the .h files are mixed with hundreds of other ones.

@bblanchon That's a good point. I guess we could prevent this by nesting the headers in an own pdfium subdirectory in the tarball, so the layout would be include/pdfium/fpdf_*.h instead of include/fpdf_*.h?

Also, a loose cpp/ dir in the top-level conda include dir is problematic. I guess that's why you were reluctant to include it in the first place?

bblanchon commented 1 year ago

@boldorider4, what's the status of this package? @mara004 and I spent hours on this, I think we deserve some closure.

mara004 commented 1 year ago

@bblanchon I have already been wondering what we are waiting for. What exactly do you want from @boldorider4? The ownership question, I suppose? As to that, I think he already indicated it's out of his hands. I'd suggest we just go ahead and upload the builds to a custom anaconda channel...

bblanchon commented 1 year ago

I, too, wondered what we were waiting for. I don't see the point of uploading to a custom Anaconda channel. If @boldorider4 provides the "official" package, why should we create another one?

mara004 commented 1 year ago

If @boldorider4 provides the "official" package, why should we create another one?

Well, for the reasons stated in https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1691426942: it isn't build regularly and not for all platforms, it's not in our control (e.g. might go non-public or stop being updated in the future), and not implemented elegantly.