Closed mara004 closed 1 year ago
CC @frgfm @felixT2K (https://github.com/mindee/doctr/issues/113)
Hi @mara004,
I see that you already created the package 👍 https://anaconda.org/anaconda/pdfium-binaries
Let me know if you want to add it to the CI.
Best regards, Benoit
Hi @bblanchon,
oh, but that was not me! 🙃 When I searched before posting this issue I couldn't find that package yet. I think it was @boldorider4, who mentioned he's packaging pypdfium2 as part of a customer project, but (unfortunately) intends to make the result private. However, apparently he publishes any dependencies of pypdfium2 that also need to be packaged, at least for the time being ... ?
However, his builds require manual interaction, and you can see from the https://anaconda.org/main/pdfium-binaries/files tab that he's not building the whole intersection of platforms supported by pdfium-binaries and conda, e.g. linux-32, linux-armv7l, win-32 and win-arm64 are missing.
So I think it would still be good to integrate conda into pdfium-binaries CI. I can give this a try but am afraid my knowledge (and opinion) of bash scripts is poor.
Don't worry, I can handle the CI part 😎; I just need to know what to do. @boldorider4, could you give us some direction on how to build this package?
boldorider4, could you give us some direction on how to build this package?
@bblanchon I've just made a draft PR with packaging except for the CI part: https://github.com/bblanchon/pdfium-binaries/pull/120 Unfortunately I'm not sure how to test it, the workflow doesn't want to run from a non-master branch in a fork...
I'll integrate this into the build and get back to you.
The first batch of Conda packages is available: https://github.com/bblanchon/pdfium-binaries/actions/runs/5986613758#artifacts How should we test them?
Thanks a lot for your effort! We definitely need to think about testing. However, even before that we can notice that something must be wrong with the osx packages, they're much smaller and just don't contain the binaries.
Nice catch! I’ll fix the glob.
Also, while the windows packages do contain their binaries, I think they're not properly following conda conventions yet.
https://docs.conda.io/projects/conda-build/en/latest/user-guide/environment-variables.html says
Unix-style packages on Windows, which are usually statically linked to executables, are built in a special Library directory under the build prefix.
See also the table from https://github.com/bblanchon/pdfium-binaries/issues/121#issuecomment-1691998538. (Not sure if this is strictly required, though.)
That part didn’t make any sense so I decided to use the same layout and see if it works.
Separate thread: when trying to install the linux package locally using conda install "./linux-64/pdfium-binaries-118.0.5961-0.tar.bz2"
it breaks with conda.auxlib.exceptions.ValidationError: Invalid value 0 for build_number
.
That part didn’t make any sense so I decided to use the same layout and see if it works.
Hmm, I think that's just some conda convention to put stuff in a Library/
subdirectory on Windows. I'm not sure if conda install
maybe handles that specially. Personally I'd just try to lay out packages as similar to conda build
as possible for robustness in case receiver tools rely on these peculiarities. But if we can test it and it works I'm of course OK with it too.
I downloaded the packages for lxml and they share the same layout:
https://repo.anaconda.com/pkgs/main/win-64/lxml-4.9.2-py39h2bbff1b_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/lxml-4.9.2-py39h5eee18b_0.tar.bz2
See? No Library\lib
nonsense.
However, they don't have the same layout as ours.
For example, here are the locations of the shared lib:
lib\python3.9\site-packages\lxml\etree.cpython-311-x86_64-linux-gnu.so
Lib\site-packages\lxml\etree.cp39-win_amd64.pyd
PS: I just started another build that should fix the missing dylib and the build_number
I downloaded the packages for lxml and they share the same layout: See? No Library\lib nonsense.
Well, lxml is a python package (with API level bindings). I'm under the impression conda python packages might have a bit of an own layout. It would be interesting to find another "C library only" windows package and see what layout it uses.
PS: I just started another build that should fix the missing dylib and the build_number
Thanks!
and the build_number
I just tried with https://github.com/bblanchon/pdfium-binaries/actions/runs/5991210170, but am still getting the same crazy error (full log attached: error.txt). I'm starting to fear conda might not allow installing local packages, or needs the special channel files as conda build generates them? 😖
Wait I think there's just something wrong with my local conda; I get this error also with other commands and packages. I'll try to create a fresh env.
OK conda clean did the trick. The build_number error might have been for unrelated reasons, I'm not sure. However, I now get a new installation error:
pdfium-binaries-118. | ############################################################################################################################################################################################################# | 100%
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/usr/lib/python3.11/site-packages/conda/exceptions.py", line 1114, in __call__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/conda/cli/main.py", line 86, in main_subshell
exit_code = do_call(args, p)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/conda/cli/conda_argparse.py", line 90, in do_call
return getattr(module, func_name)(args, parser)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/conda/cli/main_install.py", line 20, in execute
install(args, parser, 'install')
File "/usr/lib/python3.11/site-packages/conda/cli/install.py", line 177, in install
explicit(args_packages, prefix, verbose=not context.quiet)
File "/usr/lib/python3.11/site-packages/conda/misc.py", line 97, in explicit
assert not any(spec_pcrec[1] is None for spec_pcrec in specs_pcrecs)
AssertionError
Can you add a pdb.set_trace()
and analyze what's going on?
I'm not used to pdb
, but I've now added the trace before the assert and then did p specs_pcrecs
.
It is ([MatchSpec("packages/linux-64::pdfium-binaries==118.0.5961.0=1"), None],)
(with packages/
being the parent folder, and the invocation being conda install --offline "./linux-64/pdfium-binaries-118.0.5961.0-1.tar.bz2"
).
I tried to take a look at the surrounding code. PackageCacheData.query_all(spec)
apparently fails by returning an empty iterator. But I have no idea what this code is supposed to achieve.
That said, I don't think it leads us anywhere, and this is not very nice to debug because the code is in /usr
.
FWIW when creating a (python-specifc) conda package for pypdfium2 using conda build
, I am able to install it locally.
This is what the directory layout created by conda build
looks like:
And this is how specs_pcrecs
looks:
([MatchSpec("out/linux-64::pypdfium2==5.0.0b0=py311h9bf148f_0"), PackageCacheRecord(_hash=6400806049712209450, name='pypdfium2', version='5.0.0b0', build='py311h9bf148f_0', build_number=0, channel=Channel("out/linux-64"), subdir='linux-64', fn='pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', url='file:///home/mara/projects/pypdfium2/conda/out/linux-64/pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', sha256='39784aa3711fa095225bd783251b7526d880d194e95146108136edceb79f087e', arch='x86_64', platform='linux', depends=('libgcc-ng >=11.2.0', 'python >=3.11,<3.12.0a0'), constrains=(), license='Apache-2.0 or BSD-3-Clause', timestamp=1693155445.834, size=2962164, package_tarball_full_path='/home/mara/.conda/pkgs/pypdfium2-5.0.0b0-py311h9bf148f_0.tar.bz2', extracted_package_dir='/home/mara/.conda/pkgs/pypdfium2-5.0.0b0-py311h9bf148f_0', md5='db707ac8e6713558904680088ca367bc')],)
OK, no, the surroundings don't matter. When isolating the pypdfium2 package it is still installable. The docs also say it should be possible to install a downloaded package: https://docs.anaconda.com/free/anaconda/packages/install-packages/#installing-packages-on-a-non-networked-air-gapped-computer
With the same command, I get:
The current version of conda is too old to install this package. (This version only supports paths.json schema version 1.) Please update conda to install this package.
It turns out, it expects an undocumented paths_version
at the root of paths.json
🙄
After adding paths_version
, the package installs correctly, however, I'm a bit surprised by the location of the installed files.
On my Docker setup, it installs files in /opt/conda
with all the other conda files.
It's acceptable for /opt/conda/lib/libpdfium.so
but the .h
files are mixed with hundreds of other ones.
Interesting. Unfortunately I'm still stuck with the same error desipte the new build. If you're able to install the same package it might just be something wrong with my setup. I installed conda from fedora 37 repos (not sure if it's up-to-date), think I should better have used the official miniconda installer.
I tried again with the Docker images continuumio/anaconda3
and continuumio/miniconda3
and it seems to work.
C:\> docker run --rm -v %CD%:/host continuumio/miniconda3 conda install /host/pdfium-binaries-118.0.5975.0-1.tar.bz2
Downloading and Extracting Packages
Downloading and Extracting Packages
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Nice.
I will uninstall distro conda (actually more commands randomly fail) and reinstall miniconda cleanly, then test the linux package again. It might take some time. You don't need to wait for me if you think this is ready.
PS: The full version (e.g. 118.0.5975.0
) seems inconvenient for pinning or specifying bounds. The build tag (e.g. 5975
) is already unique of its own and easier to work with. Could the conda packages just use that as version, similar to this repo's tags? IIRC the full version is a bit complicated to get, so changing this would aid integration with pypdfium2.
Can't you pin to ==118.*
or >=118
?
That often stays the same across different builds. I probably have to pin to an exact build because that's what the bindings file is tied to. Also I don't currently have the full version in pypdfium2 and am not sure how to get/integrate that nicely.
Oh, and I think we might need to name the package differently (e.g. pdfium-binaries-bblanchon) to avoid collision with @boldorider4's builds.
I pulled in a fresh miniconda. Effectively the same error, just with a new description: error.txt
I'd really prefer if the Conda package had the same numbering scheme as other packages. Since pypdfium2 is independent of pdfium's version, why do you want to "pin" the version?
We must take over the pdfium-binaries
, or people will always pick the wrong package.
@boldorider4, can you transfer ownership to bblanchon?
Here is what I found in the documentation: Transferring a package to a new owner.
@mara004, you're on Fedora, right? How did you install miniconda?
@mara004, you're on Fedora, right? How did you install miniconda?
Yes. I installed it as documented here: https://docs.conda.io/en/latest/miniconda-install.html https://docs.conda.io/en/latest/miniconda.html
Also the shorthand is easy to retrieve through the tags with git ls-remote (that's what we currently use to get release URLs). The full version we'd probably need to parse out of the tarball's version file or the release notes heading, which is more complicated (doable, but not very nice).
OK nevermind about version, I think we can handle this. I agree the full version is visually nicer and consistency matters. Sorry for the fuss. (Extracting the full version and translating to that from a shorthand version should actually be pretty straightforward. Moreover it's an enhancement to also expose the full version to the user.)
I tested again with a Fedora image and it works.
Here is the Dockerfile
:
FROM fedora:37
RUN mkdir -p ~/miniconda3
RUN curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o ~/miniconda3/miniconda.sh
RUN bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
RUN rm -rf ~/miniconda3/miniconda.sh
RUN ~/miniconda3/bin/conda init bash
Here is the command I ran:
# conda install --offline pdfium-binaries-118.0.5975.0-1.tar.bz2
Downloading and Extracting Packages
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
@bblanchon @mara004,
my apologies for my silence. This thread is getting very long very quickly. Could you give me a brief summary what you guys are trying to do?
To my understanding, you packaged pdfium-binaries-118.0.5975
yourself, supposedly using a recipe of your own, and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?
I may also package that using our standard recipe methods so you can simply pull it from main
and don't have to install it --offline
. But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2
(trying to save myself from more effort).
Answering a few questions:
@mara004 Could the conda packages just use that as version, similar to this repo's tags?
you could specify in the conda recipe something along the lines of:
source:
git_url: git@github.com:bblanchon/pdfium-binaries.git
git_rev: "chromium/5975"
but it is not really encouraged, it's more condiac to rely on a pre-tarballed source archive.
source:
url: https://github.com/bblanchon/pdfium-binaries/releases/download/chromium%2F5975/pdfium-v8-mac-arm64.tgz # [osx and arm64]
sha256: <sha256> # [osx and arm64]
@bblanchon can you transfer ownership to bblanchon?
If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos).
If you really do wish to proceed with your own flavor of pdfium-binaries
, I'd suggest to upload it to the conda-forge
channel. But in that case you need to open a conda-forge
repo. You should not upload privately built packages to an official channel like main
or conda-forge
.
The good news is that I am very close to patching the setup.py
and setupsrc
scripts of pypdfium2
so that the setup process seamlessly integrates into the conda eco system. In order not to transform this into maintenance nightmare, @mara004 could have a look at the recipe and related patch (the patch will be part of the same feedstock) and implement a switch into the setup.py
that allows switching from the self-contained setup (the one that packages libpdfium.[dylib|so|dll]
into a portable package) and the setup that is meant to integrate when running conda build
or conda install
. I will comment as much as possible into the patch to make this transition easier.
I tested again with a Fedora image and it works.
@bblanchon OK. This is good because it suggests the packages are fine. Maybe there is some broken leftover from my former distro install of conda. I'll check.
@boldorider4 Thanks for joining this thread (sorry it grew so long). I'll reply to the questions directed at me later when I have more time.
Answer part 1:
Could you give me a brief summary what you guys are trying to do?
TLDR We're trying to integrate conda packaging into pdfium-binaries CI directly to avoid dependence on a third party and necessity of manual interaction, and to package for all supported platforms.
supposedly using a recipe of your own
@bblanchon wrote a custom script (not a recipe) because of problems with conda build/convert
.
See the master...conda2
diff.
I may also package that using our standard recipe methods so you can simply pull it from main and don't have to install it --offline.
Due to the direct integration we're trying to achieve, it is not possible to "merge" this into anaconda-recipes/pdfium-binaries-feedstock
. (FWIW pdfium-binaries-feedstock
does a lot more CI work than necessary and still ends up supporting less platforms, not to mention the very extensive dependence on native hosts.)
and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?
Local installation works fine for @bblanchon but fails for me. I'm currently investigating why. This may not be overly relevant for the purposes of this thread, though.
you could specify in the conda recipe something along the lines of: [...]
We were takling about the version specifier in pdfium-binaries and a possible pin in the upcoming requirements section in pypdfium2's recipe, not about the source section of a third-party style recipe. 😅 However, this is settled, I think I know now how to pin, should the necessity arise. FWIW your first suggestion were definitely not applicable, noone wants to build the binaries again, but of course use the pre-built archives. 🤪
But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2 (trying to save myself from more effort).
See the second paragraph of the Outdated
tab of https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1696018361. This is still my current state of knowledge, just not relevant for this thread anymore.
TLDR If not pinning/bounding pdfium-binaries in pypdfium2's recipe, ABI breakage (or just a degree of mismatch that renders the bindings unusable) is possible in principle, but comparatively unlikely due to pdfium's API stability practices.
OK, I see what you're doing.
I'm afraid my work on the feedstock recipe cannot be of help here.
Nonetheless, when my patch for setup.py
and setupsrc
is ready, I'd like you to have a quick review of it and let me know your thoughts. Ultimately, it would be really cool if you could merge it to the master
of pypdfium2
by protecting the code with some sort of switch. Something like:
if os.environ['CONDA_BUILD_ENABLED']:
# do what my patch does
else:
# do what setup_base.py used to do, pulling all prebuilt binaries
from pdfium-binaries and bundling them in the pypdfium package
Where CONDA_BUILD_ENABLED
is a new switch settable somewhere, maybe at the toml
level. I used an env var because that would be relatively easy to enable in our recipe.
@boldorider4 You may submit your patch as draft PR, then I can take a look (I'll leave it up to you whether to target main
or devel_v5
).
However, I doubt if this will actually be necessary. You can already set $PDFIUM_PLATFORM=none
[^1] as described here to opt out of pypdfium2's bundling. One could generate the bindings file with correct runtime libdirs externally and copy it to src/pypdfium2/raw.py
[^1] ahead of the pip install.
That said, as I plan to do the conda packaging myself properly anyway, I'm not particularly enthusiastic to aid a simultaneous paid third-party packaging project that will probably not even make the results public. I think I'd only merge a patch if it's good and also needed for our purposes.
[^1]: rsp. PDFIUM_BINARY
and raw_unsafe.py
in v5
If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos).
I was talking about the package named "pdfium-binaries" on Conda, not about the GitHub repo.
If you really do wish to proceed with your own flavor of pdfium-binaries, I'd suggest to upload it to the conda-forge channel.
But then which package conda install pdfium-binaries
will install?
But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.
Do you mean we cannot build the package in a GitHub workflow?
But then which package conda install pdfium-binaries will install?
If -c $CHANNELNAME
is specified, it would install from that channel. Otherwise, I believe it would depend on the user's channel configuration, with anaconda main being the default choice. (This is a guess, please correct me if I'm wrong.)
But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.
Do you mean we cannot build the package in a GitHub workflow?
I believe conda-forge
, unlike anaconda
, does not allow custom uploads, but requires everything to go through the feedstock repos. So we cannot use that. If we will get access to main/pdfium-binaries
on anaconda, we'll use that. If we don't, this is bad, but then we can still upload under a different name (see above for suggestions) so dependent recipes can request the packages built here and not leave it up to the end user's channel config.
BTW, this SO thread sounds related to my offline install problems.
I once ran conda clean --all
in the previous install, but @kalefranz writes
But take note: if you ever want to do any type of --offline operations, don't use --all; be more selective.
RE https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1695224193:
however, I'm a bit surprised by the location of the installed files. On my Docker setup, it installs files in /opt/conda with all the other conda files. It's acceptable for /opt/conda/lib/libpdfium.so but the .h files are mixed with hundreds of other ones.
@bblanchon That's a good point. I guess we could prevent this by nesting the headers in an own pdfium subdirectory in the tarball, so the layout would be include/pdfium/fpdf_*.h
instead of include/fpdf_*.h
?
Also, a loose cpp/
dir in the top-level conda include dir is problematic. I guess that's why you were reluctant to include it in the first place?
@boldorider4, what's the status of this package? @mara004 and I spent hours on this, I think we deserve some closure.
@bblanchon I have already been wondering what we are waiting for. What exactly do you want from @boldorider4? The ownership question, I suppose? As to that, I think he already indicated it's out of his hands. I'd suggest we just go ahead and upload the builds to a custom anaconda channel...
I, too, wondered what we were waiting for. I don't see the point of uploading to a custom Anaconda channel. If @boldorider4 provides the "official" package, why should we create another one?
If @boldorider4 provides the "official" package, why should we create another one?
Well, for the reasons stated in https://github.com/bblanchon/pdfium-binaries/issues/119#issuecomment-1691426942: it isn't build regularly and not for all platforms, it's not in our control (e.g. might go non-public or stop being updated in the future), and not implemented elegantly.
Hi @bblanchon,
There have been some requests (e.g. from doctr) that pypdfium2 should have a conda package. Basically the pre-requisite for this would be a conda package for pdfium (see below).
The Readme says:
Unfortunately, my knowledge about conda is very limited as I don't use it personally, but this would be my idea of the workflow:
conda/recipe/meta.yaml
, spec)version: {{ environ["VERSION"] }}
)build.sh
/bld.bat
scripts.conda build conda/ --output-folder conda/out/
in each runner after the binary was builtconda convert conda/out/$HOST_PLATFORM/*.tar.bz2 -p $TARGET_PLATFORM -o conda/out/
(for a list of conda platform names, see the choices for-p
inconda convert --help
)actions/upload-artifact
,conda/out/$TARGET_PLATFORM/*.tar.gz
)ANACONDA_API_TOKEN=... anaconda upload $PACKAGES_DIR/*/*.tar.bz2
- not sure, probably something like this)Presumably we need the action
conda-incubator/setup-miniconda
.Why we need this
Initially I intended to bundle the binaries in pypdfium2's conda packages (as we do for PyPI wheels), but it turns out this is not really the way to go:
n_py_versions - 1
unnecessary packages. If pdfium is packaged separately, pypdfium2 can just depend on that and cleanly specifynoarch: python
.