dotemacs / pdfboxing

Nice wrapper of PDFBox in Clojure
BSD 3-Clause "New" or "Revised" License
180 stars 38 forks source link

COSStream has been closed and cannot be read. #57

Open mladvladimir opened 4 years ago

mladvladimir commented 4 years ago

Hi,

This code snippet:

(-> (split/split-pdf :input "test/pdfs/multi-page.pdf")
    (nth <some-idx>)
    text/extract)

often throws: Execution error (IOException) at org.apache.pdfbox.cos.COSStream/checkClosed (COSStream.java:83). COSStream has been closed and cannot be read. Perhaps its enclosing PDDocument has been closed?

Thanks for lib :)

dotemacs commented 4 years ago

Hi @mladvladimir

sorry about the delay with this.

The issue is a bit annoying due to the following problem:

The way split-pdf was implemented, when it opened a document, it wouldn't close it at the end when it was done with it. So with this commit, the document was closed:

https://github.com/dotemacs/pdfboxing/commit/2ed58248360788cbdd4a4b5dbc95eadaddd9c767#diff-4c66c23479d6f9f33283c8cf88c003d0

Which works as expected if you're only using split-pdf.

But if you intend on passing that opened & split PDF document to a next function, like in your example with the threading macro, then the issue comes up.

So I knew about this, but I'm not sure what would be the right approach to solve this.

This is my current thinking: when I originally implemented those functions, I just did them all in isolation. Then somebody created a PR which allowed you to pass the opened PDF stream from one function to the next.

So if I implement a solution which will allow for an open stream to be passed onto the next function, I'd probably have to introduce a breaking change. Something that tells a function, not to close the open PDF stream. Like this:

(-> (split/split-pdf :input "test/pdfs/multi-page.pdf" :stream true)
    (nth <some-idx>)
    text/extract)

Where the new change is the :stream true.

Since you're using this, what are your thoughts on this?

Do you have some other suggestion or approach that I should consider?

Thanks

chenj7 commented 2 years ago

Still not very good at clojure despite loving it, as well as your wrapper library, but was really needing to just save out the split PDFs and this issue was blocking me. Not sure if this is in line with clojure idioms, but I wound up exposing a :callback option where you can pass a function receiving the results so that you can continue to do stuff with it, e.g. saving the docs.

(defn split-pdf
  "Split pdf into pages"
  [& {:keys [input start end split callback]}]
  (with-open [doc (common/obtain-document input)]
    (let [splitter (Splitter.)]
      (when start (.setStartPage splitter start))
      (when end (.setEndPage splitter end))
      (when split (.setSplitAtPage splitter split))
      (let [result (into [] (.split splitter doc))]
        (if callback
          (callback result)
          result)))))

Then when I went to use:

(split-pdf :input "big_doc.pdf"
           :start 1
           :end 1
           :callback (fn [[result]]
                        (.save result (path-for folder "pg1.pdf"))))