coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.32k stars 1.83k forks source link

Running on Amazon Lambda #694

Open JayVem opened 7 years ago

JayVem commented 7 years ago

Is it possible to run pdf2htmlEx on Amazon Lambda? Amazon uses its own Amazon Linux on compute instances that use Lambda. I believe this is a great use case for distributed processing - esp, for large pdf documents that take upto 15 minutes on a regular i7 desktop processor, could be done with in a minute on Lambda.

davidhedley commented 7 years ago

Unlikely ever to happen. Lambda supported only Node.js, Java and Python. pdf2htmlEX is C++ and relies on a whole bunch of supporting libraries.

JayVem commented 7 years ago

Yes, I got it working by compiling it on amazon linux and packing the executable with the lambda.

JayVem commented 7 years ago

the function itself is very simple - it just invokes pdf2htmlex as an external system process using nodejs. The difficult part was compiling pdf2htmlEX on Amazon Linux.

fasiha commented 7 years ago

Emscripten?

zeckli commented 7 years ago

Any progress or alternative solution?

careerlister commented 7 years ago

@JayVem Can you point me in the direction of instructions or provide instructions on how you were able to compile pdf2htmlex and the dependencies for Lambda? That would be greatly appreciated as we are attempting do so without success.

ardcore commented 6 years ago

@JayVem I'd also be interested -- I got to a point where I managed to compile pdf2htmlEX on Amazon Linux, but not a static binary, and I think static binary will be required here. Are you able to share some insights/the binary itself, or is it proprietary?

dengelke commented 6 years ago

@careerlister @ardcore I've managed to get it working on lambda (finally). Approach I eventually took was:

  1. Building all dependent libraries and pdf2htmlEX from source on a lambci/lambda:build-nodejs6.10 docker build image locally to deal with differences between the lambda linux environment and the host OS (in my case OSX). Initially tried on a remote Amazon Linux EC2 instance but the output didn't work on Lambda.
  2. After this move the required libraries & binaries to a lambda function on your host machine.
  3. Run on a image that replicates your desired lambda environment, my test script basically was: docker run -v \"$PWD\":/var/task lambci/lambda:nodejs6.10 index.handler '{"some":"event"}' using a child process inside node.js to call pdf2htmlEX from node.
  4. Adjust/repeat steps 1, 2 & 3 until the function works on the local docker image.
  5. Deploy to lambda ensuring your total deployment size unzipped is less then 250mb see here for an explanation. Due to using the docker image locally to test, I didn't have any issues migrating over with different operating environments.
gutitrombotto commented 5 years ago

@JayVem Can you tell me how did you installed pdf2htmlex on amazon linux? I can't install it. I have ubuntu 18 distro