space different show between linux and mac
the textract in "line break" or "space" is obviously different between linux and mac.
On linux, "line break" is parsed as multiple \n\n, and "space" is parsed as \n\n
"Line break" on mac is parsed as: \n\n, "space" is parsed as \n
linux
Python 3.8.4 (default, Jul 14 2020, 02:56:59)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import textract
>>> path = 'textract.pdf'
>>> context = textract.process(path, encoding='utf-8', extension='.pdf')
>>> context.decode('utf-8')
'textract\n\nAs undesireable as it might be, more often than not there\n\nis extremely useful information embedded in Word\n\ndocuments, PowerPoint presentations, PDFs,\n\netc—so-called “dark data”—that would be valuable for\n\nfurther textual analysis and visualization. While several\n\npackages exist for extracting content from each of\n\nthese formats on their own, this package provides a\n\nsingle interface for extracting content from any type of\n\nfile, without any irrelevant markup.\n\n\x0c'
mac
Python 3.8.4 (v3.8.4:dfa645a65e, Jul 13 2020, 10:45:06)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import textract
>>> path = 'textract.pdf'
>>> context = textract.process(path, encoding='utf-8', extension='.pdf')
>>> context.decode('utf-8')
'textract\nAs undesireable as it might be, more often than not there\nis extremely useful information embedded in Word\ndocuments, PowerPoint presentations, PDFs,\netc—so-called “dark data”—that would be valuable for\nfurther textual analysis and visualization. While several\npackages exist for extracting content from each of\nthese formats on their own, this package provides a\nsingle interface for extracting content from any type of\nfile, without any irrelevant markup.\n\n\x0c'
>>>
space different show between linux and mac the textract in "line break" or "space" is obviously different between linux and mac. On linux, "line break" is parsed as multiple \n\n, and "space" is parsed as \n\n "Line break" on mac is parsed as: \n\n, "space" is parsed as \n
linux
mac