Open KamarajuKusumanchi opened 4 years ago
Thank you for the extensive bug report.
os.path.join
escapes backward slashes, but printing these paths doesn't show them. You can look at print(repr(__file__))
to verify that they're there. The problem seems to be coming from the following changes in the re
module. Copying from the official documentation
Changed in Python version 3.6: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.
Changed in Python version 3.7: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.
This requires to escape the backlashes once to handle backslashes in string, doubling the number of backslashes, and another time to not confuse the re module, for a final of 4 backslashes for each backslash needed in a regex pattern. I'll post a fix for the Git version of textract. A larger upcoming update of textract will include this in a more complete way.
Thanks for the reply, Johannes Weytjens!
Are the 3.6 and 3.7 you mentioned above are Python versions? I am getting a different exception with Python 3.5.6
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 33, in <module>
main()
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 22, in main
parser = get_parser()
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 67, in get_parser
choices=_get_available_extensions(),
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
ext_re = re.compile(glob_filename.replace('*', r"(?P<ext>\w+)"))
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\sre_parse.py", line 834, in parse
raise source.error("unbalanced parenthesis")
sre_constants.error: unbalanced parenthesis at position 99
My conda environment file to reproduce the exception
$ cat env_test_textract.yml
name: test_textract
channels:
- defaults
dependencies:
- python=3.5
- pip
- pip:
- textract
$ python --version
Python 3.5.6 :: Anaconda, Inc.
Yes, the 3.6 and 3.7 are python versions numbers.
I can't immediately reproduce this issue with python 3.5.4 on Windows 10. Could you try again with the Git version of textract? This includes a fix for the issues in python 3.6 and above. You can install it with the following command.
pip install git+https://github.com/deanmalmgren/textract
The latest github version gives an ImportError.
$ textract -h
Traceback (most recent call last):
File "C:/ProgramData/Continuum/Anaconda/envs/test_textract/Scripts/textract", line 11, in <module>
from textract.cli import get_parser
File "C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\cli.py", line 17, in <module>
from .parsers import DEFAULT_ENCODING, _get_available_extensions
ImportError: cannot import name 'DEFAULT_ENCODING' from 'textract.parsers' (C:\ProgramData\Continuum\Anaconda\envs\test_textract\lib\site-packages\textract\parsers\__init__.py)
The DEFAULT_ENCODING was defined in 1.6.1 https://github.com/deanmalmgren/textract/blob/v1.6.1/textract/parsers/__init__.py#L25 . I think it was renamed to DEFAULT_OUTPUT_ENCODING in the latest version https://github.com/deanmalmgren/textract/blob/master/textract/parsers/__init__.py#L25 but not all the old references were cleaned up.
But even after changing all those DEFAULT_ENCODING occurrences in cli.py to DEFAULT_OUTPUT_ENCODING, I still get the same exception when I run 'textract -h'.
Thank you for pointing out that I missed changing the DEFAULT_ENCODING
everywhere. Nevertheless, fixing this I can't reproduce the issue you encounter. I will look more into the problem this weekend.
Describe the bug The command line interface of textract is broken on windows. Even simple commands like "textract -h' is giving an exception.
To Reproduce
Expected behavior The command should print help page for textract and should not raise an exception.
Desktop (please complete the following information):
Additional context As the exception indicates the problem lies in _get_available_extensions() of https://github.com/deanmalmgren/textract/blob/master/textract/parsers/__init__.py#L89
The relevant code is
The
__file__
and glob_filename are evaluated asI am able to reproduce the exception using these values as follows:
I think it is expecting forward slashes instead of backward slashes in glob_filename. Something like 'C:/ProgramData/Continuum/Anaconda/envs/test_textract/lib/site-packages/textract/parsers/*_parser.py'