deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

not able extract text from file using python package #298

Closed swamyaddala closed 5 years ago

swamyaddala commented 5 years ago

while using python package to extract text from any file showing an error error: syntax error near unexpected token `(' code: import textract text = textract.process('root/Desktop/from.pdf')

and how to change identifying language from english to another language

jpweytjens commented 5 years ago

Could you provide more information please? Specifically, your python version, the version of textract you're using and the full error log.

Sent with GitHawk

swamyaddala commented 5 years ago

python version : Python 3.7.4 textract version : textract 1.6.1

error: ./swamy.py: line 2: syntax error near unexpected token (' ./swamy.py: line 2:text = textract.process('root/Desktop/from.pdf')'

swamyaddala commented 5 years ago

Could you provide more information please? Specifically, your python version, the version of textract you're using and the full error log.

Sent with GitHawk

sir I have replied with error log , please give me solution.

jpweytjens commented 5 years ago

Could you provide more information please? Specifically, your python version, the version of textract you're using and the full error log. Sent with GitHawk

sir I have replied with error log , please give me solution.

Please keep in mind that this project is maintained in the free time of the maintainers. I do my best to reply to all issues and find solutions. Solving these issues works better when you provide clear information on the issue and how to reproduce it as well as discussing the problem in a respectfull way.

I'm not able to reproduce this error on my end. It seems like there's a syntax error unrelated to textract. Is it possible to provide the full error log, i.e. the complete traceback formatted as code? (This can be done by selecting the code in the edit window and surrounding it with ``.) Right now, it is impossible to understand what is causing this issue as you're only showing the error type and not the traceback. As is seems like a syntax error, also please provide the entireswampy.py` file.

Please also tell me how you're running your swampy.py file? Are you running this in an editor, in the terminal, ...? How you invoke this file can be the cause for the unexpected token you're seeing.

swamyaddala commented 5 years ago

How can I use textract to extract text from file of another language other than english ?

I have attached screenshots of terminal Screenshot from 2019-08-24 19-07-35 Screenshot from 2019-08-24 19-08-55 I have renamed swamy.py to swamy.txt because I am not able to upload swamy.py file swamy .txt

After running command it is creating a file with name textract and it contains screenshot of only terminal and then showing error.

jpweytjens commented 5 years ago

There are a few things happening here.

In the first screenshot, you are trying to run the python script as a bash script. This can only be done if you add a shebang to the script. The following for example should work:

swamy.py

#!/usr/bin/env python
import textract
text = textract.process('from.pdf')

This script can be run as ./swamy.py. The added line tells bash to interpret this script with python.

The second screenshots runs the scripts with python and shows no errors except for the file not being found. Since you're running the script from ~/Desktop, it's sufficient to just specify the filename as shown in the example above. If you want to run the script from a different location than where the file is located, I recommend looking into the pathlib package to specify the path to the file. Do note however, that with textract version 1.6.3 and below, you will need to convert the pathlib path to a string. (An upcoming version of textract will be able to parse pathlib paths correctly without the need to convert to a string.)

Lastly, from the screenshot it looks like from.pdf is a 'normal' pdf containing plain text. This can be extracted with textract, regardless of language, with the pdfminer and pdftotext methods. These are the default methods. Specifying the language is only required if the pdf contains no plain text, but instead contains images, hand-written text, scanned text, ... To parse these as English text, you need to do the following

text = textract.process("from.pdf", method="tesseract", language="eng")

All the options are described in more detail in the documentation. This does require tesseract being installed along with the required language files for tesseract. Neither of these come with textract automatically. To install these, have a look at their documentation or this unofficial guide.

jpweytjens commented 5 years ago

Closing this issue as it should be solved. If you still encounter problems, I'll happily reopen this issue.