documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
833 stars 214 forks source link

Break PDFs into chunks bigger than 1 page? #128

Open AbeHandler opened 9 years ago

AbeHandler commented 9 years ago

I just got a very large PDF. I want to break it into smaller PDFs -- but not thousands and thousands of them. Would you be open to a pull request that added this feature to the pages command? Something like

$docsplit pages big.pdf --pages 1-1000 --numoutput 1 #breaks the first 1000 pages into a single file

page_extractor.rb

# Burst a list of pdfs into single pages, as `pdfname_pagenumber.pdf`.
def extract(pdfs, opts)
  extract_options opts
  [pdfs].flatten.each do |pdf|
    pdf_name = File.basename(pdf, File.extname(pdf))
    page_path = ESCAPE[File.join(@output, "#{pdf_name}")] + "_%d.pdf"
    FileUtils.mkdir_p @output unless File.exists?(@output)

    cmd = if DEPENDENCIES[:pdftailor] # prefer pdftailor, but keep pdftk for backwards compatability
      "pdftailor unstitch --output #{page_path} #{ESCAPE[pdf]} 2>&1"
    else
      "pdftk #{ESCAPE[pdf]} burst output #{page_path} 2>&1"
    end
    result = `#{cmd}`.chomp
    FileUtils.rm('doc_data.txt') if File.exists?('doc_data.txt')
    raise ExtractionFailed, result if $? != 0
    result
  end
end
AbeHandler commented 9 years ago

Seems like pdftk can do this https://charlieharvey.org.uk/page/howto_breaking_pdfs_up_into_mutiple_pages

knowtheory commented 9 years ago

Hey @AbeHandler,

Yep you're right. Adding page ranges to Docsplit will also require adding them to PDFtailor since PDFtailor just splits a PDF into all of it's constituent pages. If you are interested in adding page ranges to PDFtailor a pull request would be more than welcome!

Although i'm not so down with the --numoutput. My feeling is that if you want pages, use the page subcommand, if you want pdfs we should be talking about the pdf command.

i'm more comfortable with something like docsplit pdf source.pdf --pages 1-5 10-20 30-37 or maybe even docsplit pdf source.pdf --split 1-5 10-20 30-37

pickhardt commented 1 year ago

Hi, just checking if page ranges have been added? I want to be able to do Docsplit.extract_text(filepath, start_page: 20, end_page: 25) for example.