gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.22k stars 69 forks source link

Do we have mechanism to read PDF S3 URL? #194

Closed userrails closed 1 year ago

userrails commented 1 year ago

Hi @gettalong,

I want to use PDF merge feature to merge PDFs by reading from the S3 URL. Do we have mechanism for that?

I'm expecting something like this:

HexaPDF::Document.open(s3_url)

I've read this issue: https://github.com/gettalong/hexapdf/issues/136. I wanted to confirm if we have any such feature recently developed.

Thank you.

gettalong commented 1 year ago

Using PDFs stored in S3 is orthogonal to this library. HexaPDF has the general mechanism that you can provide an IO object to read a PDF from. So what you need to do is, get an IO object for the S3 URL and pass it to HexaPDF via HexaPDF::Document.new(io: s3url_io).

It doesn't matter to HexaPDF whether the PDF is stored as file, in a database, in S3, ... or just in a StringIO. All it cares about is that it gets an IO object.

gettalong commented 1 year ago

I don't know what an AWS::S3::Object is (I have never used S3). You either have to get an IO object or a binary string with the data and wrap that then in a StringIO.

userrails commented 1 year ago

It's resolved now. Thank you.

gettalong commented 1 year ago

@userrails Great - could you write a comment with how you solved so that others may find the solution? Thank you!

userrails commented 1 year ago

@gettalong Hope this feedback will be helpful to everyone.

The arbitrary string.

binary_string = Net::HTTP.get(URI.parse("https://example.s3.amazonaws.com/uploads/document.pdf?155463860"))

StringIO method used to set as the file object

file = StringIO.new(binary_string)

Create a new PDF document by reading from the provided io

doc = HexaPDF::Document.new(io: file)

Writes the document to the given file name.

doc.write('document.pdf', optimize: true)
userrails commented 1 year ago

@gettalong

I'm able to combine multiple PDFs using following code.

@target = HexaPDF::Document.new

bills.each do |inv|
    path = inv.file.url
    t = Tempfile.new([inv.id, '.pdf'])
    t.write(open(path).read.force_encoding(Encoding::UTF_8))
    t.close

    pdf = HexaPDF::Document.open(t.path)
    pdf.pages.each { |page| @target.pages << @target.import(page) }
 end

@target.write('combined.pdf', optimize: true)

Without saving PDF on temp location as shown in above code, is it possible to use HexaPDF::Document.open() method to accept StringIO object or may be binary string data?

I have multiple PDFs on object bills which I want to merge into one PDF. Just curious to know if there are other ways in HexaPDF to merge multiple PDFs.

gettalong commented 1 year ago

@gettalong Hope this feedback will be helpful to everyone.

Thanks, yes!

The arbitrary string.

binary_string = Net::HTTP.get(URI.parse("https://example.s3.amazonaws.com/uploads/document.pdf?155463860"))

Ah, so you are just using the URI without going through the S3 client gem. I think there should be a way to do this with the client gem since it has to be possible to get the contents of some blob stored in there.

gettalong commented 1 year ago

@gettalong

I'm able to combine multiple PDFs using following code.

@target = HexaPDF::Document.new

bills.each do |inv|
    path = inv.file.url
    t = Tempfile.new([inv.id, '.pdf'])
    t.write(open(path).read.force_encoding(Encoding::UTF_8))
    t.close

    pdf = HexaPDF::Document.open(t.path)
    pdf.pages.each { |page| @target.pages << @target.import(page) }
 end

@target.write('combined.pdf', optimize: true)

Without saving PDF on temp location as shown in above code, is it possible to use HexaPDF::Document.open() method to accept StringIO object or may be binary string data?

The HexaPDF::Document.open method only accepts a file name. If you already have an IO object, just use HexaPDF::Document.new(io: io_object).

What you should be able to do is the following:

@target = HexaPDF::Document.new

bills.each do |inv|
  open(inv.file_url) do |io|
    pdf = HexaPDF::Document.new(io: io)
    pdf.pages.each {|page| @target.pages << @target.import(page) }
  end
end

@target.write('combined.pdf', optimize: true)

So if you can use open to read inv.file_url, you should be able to pass the created IO object directly to HexaPDF.

userrails commented 1 year ago

@gettalong

In combined pdf gem there was mechanism to read pdf from object without downloading it. This method is used to_pdf(options = {}) https://github.com/boazsegev/combine_pdf/blob/master/lib/combine_pdf/pdf_public.rb. Do we have any such feature ?

I want to allow user to download pdf directly from browser.

gettalong commented 1 year ago

I'm not really sure what you mean. My guess is you mean rendering to a string instead of a file? If so, then yes, this is possible, just supply a StringIO object when using HexaPDF::Document#write.

As for showing it in the browser without downloading: This has nothing to do with HexaPDF, you need to do this in the web framework.

userrails commented 1 year ago

In my previous comment, I meant, I wanted to allow user to download pdf directly from browser.

@gettalong But write() is writing to some IO file or string which is fine and working perfect for me. But this way i need to write PDF into some IO file or string and then again I have read it using other tools. This is working for me.

@target = HexaPDF::Document.new

bills.each do |inv|
  open(inv.file_url) do |io|
    pdf = HexaPDF::Document.new(io: io)
    pdf.pages.each {|page| @target.pages << @target.import(page) }
  end
end

I want to know if @target object has all PDFs combined in some form, then without writing it to some IO file or IO string. Is it possible to read this object and convert to binary string which has PDF content?

When i inspect i can see following object.

> @target
=> <HexaPDF::Document:2256675220>
gettalong commented 1 year ago

@target is a HexaPDF::Document which holds the internal representation of a PDF file. You are adding pages to it from other PDF files, so yes, it contains those pages.

To get the on-disk representation you need to invoke @target.write. If you want to have a binary string with the contents you need to do io = StringIO.new(''.b); @target.write(io); result = io.string.

userrails commented 1 year ago

Okay, document write() process is mandatory. Thanks for your feedback.