deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.84k stars 585 forks source link

command line interface is broken on windows #313

Open KamarajuKusumanchi opened 4 years ago

KamarajuKusumanchi commented 4 years ago

Describe the bug The command line interface of textract is broken on windows. Even simple commands like "textract -h' is giving an exception.

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2

To Reproduce

  1. Install textract in a test environment and activate it.
$ cat /h/work/myrepos/rutils/python3/envs/env_test_textract.yml
name: test_textract
channels:
  - defaults
dependencies:
  - python=3.7
  - pip
  - pip:
    - textract

$ conda env create -f env_test_textract.yml

$ source activate test_textract
  1. Run textract
    $ textract -h
    Traceback (most recent call last):
    File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
    File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
    File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
    re.error: bad escape \P at position 2

Expected behavior The command should print help page for textract and should not raise an exception.

Desktop (please complete the following information):

Additional context As the exception indicates the problem lies in _get_available_extensions() of https://github.com/deanmalmgren/textract/blob/master/textract/parsers/__init__.py#L89

The relevant code is

    parsers_dir = os.path.join(os.path.dirname(__file__))
    glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))

The __file__ and glob_filename are evaluated as

C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py
C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py

I am able to reproduce the exception using these values as follows:

$ python
Python 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> glob_filename = 'C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\*_parser.py' >>> ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 234, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 924, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 501, in _parse
    code = _escape(source, this, state)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))
re.error: bad escape \P at position 2

I think it is expecting forward slashes instead of backward slashes in glob_filename. Something like 'C:/ProgramData/Continuum/Anaconda/envs/test_textract/lib/site-packages/textract/parsers/*_parser.py'

jpweytjens commented 4 years ago

Thank you for the extensive bug report.

os.path.join escapes backward slashes, but printing these paths doesn't show them. You can look at print(repr(__file__)) to verify that they're there. The problem seems to be coming from the following changes in the re module. Copying from the official documentation

Changed in Python version 3.6: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.
Changed in Python version 3.7: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.

This requires to escape the backlashes once to handle backslashes in string, doubling the number of backslashes, and another time to not confuse the re module, for a final of 4 backslashes for each backslash needed in a regex pattern. I'll post a fix for the Git version of textract. A larger upcoming update of textract will include this in a more complete way.

KamarajuKusumanchi commented 4 years ago

Thanks for the reply, Johannes Weytjens!

Are the 3.6 and 3.7 you mentioned above are Python versions? I am getting a different exception with Python 3.5.6

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
    main()
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
    parser = get_parser()
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
    choices=_get_available_extensions(),
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
    ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 224, in compile
    return _compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 834, in parse
    raise source.error("unbalanced parenthesis")
sre_constants.error: unbalanced parenthesis at position 99

My conda environment file to reproduce the exception

$ cat env_test_textract.yml
name: test_textract
channels:
  - defaults
dependencies:
  - python=3.5
  - pip
  - pip:
    - textract
$ python --version
Python 3.5.6 :: Anaconda, Inc.
jpweytjens commented 4 years ago

Yes, the 3.6 and 3.7 are python versions numbers.

I can't immediately reproduce this issue with python 3.5.4 on Windows 10. Could you try again with the Git version of textract? This includes a fix for the issues in python 3.6 and above. You can install it with the following command.

pip install git+https://github.com/deanmalmgren/textract
KamarajuKusumanchi commented 4 years ago

The latest github version gives an ImportError.

$ textract -h
Traceback (most recent call last):
  File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 11, in <module>
    from textract.cli import get_parser
  File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 17, in <module>
    from .parsers import DEFAULT_ENCODING, _get_available_extensions
ImportError: cannot import name 'DEFAULT_ENCODING' from 'textract.parsers' (C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py)

The DEFAULT_ENCODING was defined in 1.6.1 https://github.com/deanmalmgren/textract/blob/v1.6.1/textract/parsers/__init__.py#L25 . I think it was renamed to DEFAULT_OUTPUT_ENCODING in the latest version https://github.com/deanmalmgren/textract/blob/master/textract/parsers/__init__.py#L25 but not all the old references were cleaned up.

But even after changing all those DEFAULT_ENCODING occurrences in cli.py to DEFAULT_OUTPUT_ENCODING, I still get the same exception when I run 'textract -h'.

jpweytjens commented 4 years ago

Thank you for pointing out that I missed changing the DEFAULT_ENCODING everywhere. Nevertheless, fixing this I can't reproduce the issue you encounter. I will look more into the problem this weekend.