deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

parse space different show between linux and mac #388

Open shzy2012 opened 3 years ago

shzy2012 commented 3 years ago

space different show between linux and mac the textract in "line break" or "space" is obviously different between linux and mac. On linux, "line break" is parsed as multiple \n\n, and "space" is parsed as \n\n "Line break" on mac is parsed as: \n\n, "space" is parsed as \n

linux

Python 3.8.4 (default, Jul 14 2020, 02:56:59) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import textract
>>> path = 'textract.pdf'
>>> context = textract.process(path, encoding='utf-8', extension='.pdf')
>>> context.decode('utf-8')
'textract\n\nAs undesireable as it might be, more often than not there\n\nis extremely useful information embedded in Word\n\ndocuments, PowerPoint presentations, PDFs,\n\netc—so-called “dark data”—that would be valuable for\n\nfurther textual analysis and visualization. While several\n\npackages exist for extracting content from each of\n\nthese formats on their own, this package provides a\n\nsingle interface for extracting content from any type of\n\nfile, without any irrelevant markup.\n\n\x0c'

mac

Python 3.8.4 (v3.8.4:dfa645a65e, Jul 13 2020, 10:45:06) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import textract
>>> path = 'textract.pdf'
>>> context = textract.process(path, encoding='utf-8', extension='.pdf')
>>> context.decode('utf-8')
'textract\nAs undesireable as it might be, more often than not there\nis extremely useful information embedded in Word\ndocuments, PowerPoint presentations, PDFs,\netc—so-called “dark data”—that would be valuable for\nfurther textual analysis and visualization. While several\npackages exist for extracting content from each of\nthese formats on their own, this package provides a\nsingle interface for extracting content from any type of\nfile, without any irrelevant markup.\n\n\x0c'
>>> 
WechatIMG9750
deanmalmgren commented 3 years ago

This sounds like a problem with pdftotext rather than an issue with textract. Can you confirm that pdftotext has the same behavior on both systems?