bblanchon / pdfium-binaries

📰 Binary distribution of PDFium
908 stars 195 forks source link

DO NOT MERGE: WebAssembly standalone support #103

Open jerbob92 opened 1 year ago

jerbob92 commented 1 year ago

@bblanchon I promised to come back to you when I finished my WebAssembly implementation, so here it is! It's not ready to merge this in, since it will break support for the web build, but it is possible to use this as a base to create 2 WASM builds: web and standalone.

This contains the following changes:

The thing that I'm not really happy with is how the Emscripten patch is applied, but I couldn't get it to work in another way, since emsdk/upstream/emscripten is not a git repo.

bblanchon commented 1 year ago

Hi @jerbob92,

Thank you very much for sharing this!

I didn't expect that we'd have to patch emscriptem and add a new features PDF. I think this is out of the scope of this repository: the goal here is to provide a pre-built PDFium, not extend PDFium. Do you think these changes could be integrated upstream?

I'll try to cherry-pick the changes you recommend right away. I originally downgraded emscriptem in f09df4a12c517135b5ead6f3c5962776610a5acb to work around the error "Uncaught RuntimeError: unreachable"; hopefully, this bug is gone.

Best regards, Benoit

jerbob92 commented 1 year ago

I think Emscripten is not going to accept the patch in this state because it only implements what PDFium uses, nothing else.

And regarding extending PDFium: I don't think there is a way to work around this, to use function pointers in WebAssembly, the pointer needs to exist during compile time or you need to add the pointer to a method during runtime, which many runtimes don't support, that's why I had to implement various callbacks in the PDFium patch and then implement the callback logic on the runtime side. So I wouldn't really see it as extending PDFium, this is just a way to make all functionality work on in WebAssembly.

I originally downgraded emscriptem in https://github.com/bblanchon/pdfium-binaries/commit/f09df4a12c517135b5ead6f3c5962776610a5acb to work around the error "Uncaught RuntimeError: unreachable"; hopefully, this bug is gone.

I think this was related to the memory allocator that has since been disabled.

jerbob92 commented 1 year ago

@bblanchon small warning, free and malloc don't seem to export by default anymore in the latest Emscripten, I had to add them to EXPORTED_FUNCTIONS myself. Just in case you want to upgrade Emscripten in the future. The build works fine but people using the Webassembly build will need access to free and malloc, although I'm not sure if that's also the case for JS users.

jerbob92 commented 1 year ago

@bblanchon I have been working on making an embind implementation for Go/Wazero here: https://github.com/jerbob92/wazero-emscripten-embind

Would you consider merging this If I convert the changes from patches/wasm/callbacks.patch into Embind? That way the JS implementations can also benefit from it (a lot), since it makes it a lot easier to use/access C things from JS (and in my case Go).

bblanchon commented 1 year ago

My main concern with this PR is that it not only patches PDFium, but it also patches emscriptem. There are roughly 1500 lines of patches in the change set!

I am concerned that this build will fail often and require frequent maintenance. Also, I'm afraid the patches are so complex that no one would be able to take over the maintenance if you were to stop working on it. Lastly, I don't see a lot of demand for the build, so it looks like a lot of work for very few users.

I'm okay with adding a standalone build or modifying the existing wasm build to support both, but I don't want this project to be a maintenance nightmare.

jerbob92 commented 1 year ago

Understandable, I was not talking about the emscripten patch though, I'm still working on getting that merged into Emscripten, but the current discussion there is whether they are just going to include wasi-libc into Emscripten (most of the stuff in my Emscripten patch comes from wasi-libc).

I was specifically asking about the embind integration, which is a patch I'm currently working on (in favor of the callbacks patch).

bblanchon commented 1 year ago

Honestly, I'm not a big fan of this callback patch, probably because I don't understand it. It looks like you trying to overcome the limitations of the Wazero runtime, but that has never been my goal. This repo is supposed to provide prebuilt binaries of PDFium, not to modify the library.

jerbob92 commented 1 year ago

I'm starting to feel like I'm the only one using the WebAssembly build :D I will try to explain:

sungaila commented 1 year ago

I'm starting to feel like I'm the only one using the WebAssembly build :D

There are at least two. I have two GitHub pages hosted right now using these PDFium WebAssembly builds: https://www.sungaila.de/PDFtoImage/ and https://www.sungaila.de/PDFtoZPL/

My biggest issue right now is the WASM NuGet package. I need specific builds with specific Emscripten versions and the NuGet package would need to include one build per version. Each .NET release depends on a single Emscripten release.

I didn't find a way an easy way to modify the existing GitHub workflows to make this happen (without rewritting most bash scripts and workflow YMLs). So instead I use my fork to build with user-defined Emscripten version: https://github.com/sungaila/pdfium-binaries/blob/3213c3f882bd75891f7da8b9644111708a3fa370/.github/workflows/build-one.yml#L44C7-L48C26

jerbob92 commented 1 year ago

There are at least two. I have two GitHub pages hosted right now using these PDFium WebAssembly builds: https://www.sungaila.de/PDFtoImage/ and https://www.sungaila.de/PDFtoZPL/

Ah cool! But is there anyone using the JS WASM version?

It looks like the .NET runtime is already doing a lot for you in terms of data conversion and function pointer handling. So basically it looks like they built the Embind thing inside the runtime.

My biggest issue right now is the WASM NuGet package. I need specific builds with specific Emscripten versions and the NuGet package would need to include one build per version. Each .NET release depends on a single Emscripten release.

Honestly, this sounds like a major design flaw in the .NET WASM support? But seeing how they integrate the library (with the .a file and not the .wasm file), it makes sense that it works like this.

sungaila commented 1 year ago

Honestly, this sounds like a major design flaw in the .NET WASM support? But seeing how they integrate the library (with the .a file and not the .wasm file), it makes sense that it works like this.

They transpile the .NET code into C++ and then into Wasm. Might explain why they want .a files but I still despise that Emscripten SDK version lock-in.

jerbob92 commented 1 year ago

Yeah it's probably not something they like to do, but it works like that because they include the .a file in the compilation process, which does allow them to use the WebAssembly more freely, but comes with the downside that the compiled .a file can only be used by that Emscripten version, since the .a file is not a build result, but a intermediary file during the build process to WebAssembly.

I don't see that changing without .NET changing the way how it integrates the WebAssembly.

bblanchon commented 1 year ago

I do not like the currently patch either, which is why I asked if you would accept it if it was written using Embind, which is native to Emscripten

@jerbob92, I didn't realize that Embind would allow you to get rid of callback.patch. I'm okay with generating binding with Embind and including them in the wasm build. You're right; this is not a modification of PDFium and doesn't seem to require a lot of maintenance, so please go for it!

jerbob92 commented 1 year ago

I have started implementing this but it's starting to look like I can't export everything from pdfium directly using Embind because it doesn't support everything, for example incomplete types, which most types are, and pointers to native types, so I have been hitting some roadblocks.

I'm currently in contact with Emscripten to see how we could implement this, but if it requires wrapping every function with an Embind function this whole endeavor seems kinda pointless. I'll keep you updated.