deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

The extension .pptx is not supported. I know ppt is not but why .pptx? In the available extensions it shows .pptx but it is not working. #333

Closed anirudhpnbb closed 4 years ago

anirudhpnbb commented 4 years ago

I am attaching the error it is throwing

Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/pptx/parts/image.py", line 11, in from PIL import Image as PIL_Image File "/usr/lib/python3/dist-packages/PIL/Image.py", line 60, in from . import _imaging as core ImportError: cannot import name '_imaging' from 'PIL' (/usr/lib/python3/dist-packages/PIL/init.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/textract/parsers/init.py", line 69, in process rel_module, 'textract.parsers' File "/usr/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/usr/local/lib/python3.7/dist-packages/textract/parsers/pptx_parser.py", line 1, in import pptx File "/usr/local/lib/python3.7/dist-packages/pptx/init.py", line 14, in from pptx.api import Presentation # noqa File "/usr/local/lib/python3.7/dist-packages/pptx/api.py", line 15, in from .package import Package File "/usr/local/lib/python3.7/dist-packages/pptx/package.py", line 14, in from .parts.image import Image, ImagePart File "/usr/local/lib/python3.7/dist-packages/pptx/parts/image.py", line 13, in import Image as PIL_Image ModuleNotFoundError: No module named 'Image'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anirudh/Desktop/pdo_ext/text_process.py", line 15, in main() File "/home/anirudh/Desktop/pdo_ext/text_process.py", line 8, in main startprocess('/home/anirudh/Desktop/new_mock') File "/home/anirudh/Desktop/pdo_ext/extraction.py", line 23, in extraction_process textex(allfiles, alltext) # Starting the text extraction on all the files available. File "/home/anirudh/Desktop/pdo_ext/textext.py", line 21, in text_extraction output = textract.process(file_name).decode('utf-8') File "/usr/local/lib/python3.7/dist-packages/textract/parsers/init.py", line 72, in process raise exceptions.ExtensionNotSupported(ext) textract.exceptions.ExtensionNotSupported: The filename extension .pptx is not yet supported by textract. Please suggest this filename extension here:

https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx

wajdikhattel commented 4 years ago

Well .pptx is supported, but if you read carefully it says that ModuleNotFoundError: No module named 'Image' and this is due to that in the textract.parsers.__init__.py there is a try/except that tests only on importing the parser correctly and not its existance.

So basically if you install the Image module, this will be solved for you.