Open hongtaicao opened 7 years ago
I am unable to reproduce this issue:
# TESTING ON COMMAND LINE INTERFACE
[bash]$ touch test.pyc
[bash]$ textract test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:
https://github.com/deanmalmgren/textract/issues
Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx
[bash]$ textract ./test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:
https://github.com/deanmalmgren/textract/issues
Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx
# TESTING WITH PYTHON SCRIPT
[bash]$ echo "import textract" > blah.py
[bash]$ echo "textract.process('./test.py')" > blah.py
[bash]$ python blah.py
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:
https://github.com/deanmalmgren/textract/issues
Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx
Does the actual filename in question have parentheses in it or something?
Hi Dean, Thank you so much for investigation. I tried what you did, i.e. process an empty .pyc file. I still got the same error. I suspect this issue only happens on Windows. I will try to investigate more about it and provide further details later.
Hi!
Library re generates error on Windows because path on Windows with backslash "\".
Next workaround works for me (file: parsers\__init.py__
):
# from filenames
parsers_dir = os.path.join(os.path.dirname(__file__))
glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
glob_filename = glob_filename.replace("\\", "/") # <------------------------------ THIS
ext_re = re.compile(glob_filename.replace('*', "(?P<ext>\w+)"))
for filename in glob.glob(glob_filename):
filename = filename.replace("\\", "/") # <------------------------------------ THIS
ext_match = ext_re.match(filename)
ext = ext_match.groups()[0]
extensions.append(ext)
extensions.append('.' + ext)
I tried an unsupported format to it using the following
textract.process('./test.pyc')
and I got the following error:
Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)] on win32 VERSION = '1.6.1'
I believe the program raised an exception, but it actually crashed inside
re
module. Could you please look into it?