deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

error: unbalanced parenthesis #168

Open hongtaicao opened 7 years ago

hongtaicao commented 7 years ago

I tried an unsupported format to it using the following textract.process('./test.pyc')

and I got the following error:

Exception raised:
    Traceback (most recent call last):
      File "C:\decodertextract.py", line 70, in __decoder_textract
        print textract.process(pathname)
      File "C:\Program Files\Python27\lib\site-packages\textract\parsers\__init__.py", line 72, in process
        raise exceptions.ExtensionNotSupported(ext)
      File "C:\Program Files\Python27\lib\site-packages\textract\exceptions.py", line 21, in __init__
        for e in _get_available_extensions():
      File "C:\Program Files\Python27\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
        ext_re = re.compile(glob_filename.replace('*', "(?P<ext>\w+)"))
      File "C:\Program Files\Python27\lib\re.py", line 194, in compile
        return _compile(pattern, flags)
      File "C:\Program Files\Python27\lib\re.py", line 251, in _compile
        raise error, v # invalid expression
    error: unbalanced parenthesis

Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)] on win32 VERSION = '1.6.1'

I believe the program raised an exception, but it actually crashed inside re module. Could you please look into it?

deanmalmgren commented 7 years ago

I am unable to reproduce this issue:

# TESTING ON COMMAND LINE INTERFACE
[bash]$ touch test.pyc
[bash]$ textract test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx
[bash]$ textract ./test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx

# TESTING WITH PYTHON SCRIPT
[bash]$ echo "import textract" > blah.py
[bash]$ echo "textract.process('./test.py')" > blah.py
[bash]$ python blah.py
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx

Does the actual filename in question have parentheses in it or something?

hongtaicao commented 7 years ago

Hi Dean, Thank you so much for investigation. I tried what you did, i.e. process an empty .pyc file. I still got the same error. I suspect this issue only happens on Windows. I will try to investigate more about it and provide further details later.

AntonLocal commented 6 years ago

Hi! Library re generates error on Windows because path on Windows with backslash "\". Next workaround works for me (file: parsers\__init.py__):

    # from filenames
    parsers_dir = os.path.join(os.path.dirname(__file__))
    glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
    glob_filename = glob_filename.replace("\\", "/") # <------------------------------ THIS
    ext_re = re.compile(glob_filename.replace('*', "(?P<ext>\w+)"))
    for filename in glob.glob(glob_filename):
        filename = filename.replace("\\", "/") # <------------------------------------ THIS
        ext_match = ext_re.match(filename)
        ext = ext_match.groups()[0]
        extensions.append(ext)
        extensions.append('.' + ext)