documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

DocSplit fails to extract text on Windows when filenames have spaces #41

Open overview opened 12 years ago

overview commented 12 years ago

Docsplit invokes pdftotext to extract text, escaping spaces in the filename with \ to construct a command line. On Windows, \ does not escape spaces.

In my instrumented test, docsplit attempts to execute the following command:

pdftotext -enc UTF-8 test-docs/Ideology\ and\ Climate\ Change.pdf extracted-text/Ideology\ and\ Climate\ Change.txt 2>&1

This fails with the error message below. The following command works:

pdftotext -enc UTF-8 "test-docs/Ideology and Climate Change.pdf" "extracted-text/Ideology and Climate Change.txt" 2>&1

You will need poppler on Windows to reproduce, which is available here: http://www.compgeom.com/~piyush/scripts/scripts.html

Full error message follows:

C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/text_extractor.r b:99:in run': pdftotext version 0.16.6 (Docsplit::ExtractionFailed) Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC Usage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta informatio n -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:106:inextract_full' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:54:in extract_from_pdf' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:38:inblock in extract' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:32:in each' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:32:inextract' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb: 51:in extract_text' from overview-prototype/docloader/docloader.rb:75:inprocessFile' from overview-prototype/docloader/docloader.rb:150:in block in <main>' from overview-prototype/docloader/docloader.rb:50:incall' from overview-prototype/docloader/docloader.rb:50:in block in scanDir' from overview-prototype/docloader/docloader.rb:42:inforeach' from overview-prototype/docloader/docloader.rb:42:in scanDir' from overview-prototype/docloader/docloader.rb:150:in

'

Natim commented 12 years ago

This is a Windows CLI related error, you should always use " for a filename containing spaces.