deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

numbers ignored #305

Closed sylwiaoz closed 5 years ago

sylwiaoz commented 5 years ago

Hello,

When extracting content from docx files, numbers are ignored. I mean not only section numbers but also those that are part of sentences. Is there a way to overcome this problem ? I don't mind losing section numbers but the rest is important for me to keep.

I would appreciate if you could help with that.

Cheers Sylwia

jpweytjens commented 5 years ago

Textract uses python-docx2txt which can't extract numbers from section titles. Numbers in plain text should be extracted though. Can you provide the following information to further investigate the issue?

An upcoming version of textract will add pandoc as a docx parsing method. Pandoc can extract text that python-docx2txt can not such as footnotes.

sylwiaoz commented 5 years ago

OS: Ubuntu 18.04 LTS (on Windows 10) Textract version: 1.6.3 Python version: 3.6.8 Virtual environment: no The docx file: cannot provide it, sorry. Just to give an example, the sequence

micro-switch S2.2 de l’EkrProCom 50 correspondant" (it's French)

is extracted as

micro-switch S . de l’EkrProCom correspondant..

jpweytjens commented 5 years ago

I can not reproduce this issue with any of the word files that I have acces to. Is it possible to share a small section of the document? For example, a copy of the original file with only the sentence "micro-switch S2.2 de l’EkrProCom 50 correspondant"? Can you also share the code that you use to process the docx file? A screenshot is also fine.

I think the issue might be related to the (encoding of) the docx file or how the extracted text is displayed, hence why I ask for a small section of the file or a screenshot.

sylwiaoz commented 5 years ago

Here's a docx with the sentence. Nevermind the fact that the sentence is spit in two in my example. It's due to a sentence recognition function. When run textract on the docx containing only the original sentence, I am still getting a result without numbers: ... micro-switch S. de l’EkrProCom correspondant.

test.docx

def extract_text_self(path):

    basename = os.path.basename(path)
    fileName, fileExtension = os.path.splitext(path)

    #extraction du texte
    T = []
    if '#' not in basename or '~' not in basename:
        if fileExtension == '.docx':
            text =  textract.process(path,method='tesseract')
            app_l = remove_noise(text.decode("utf-8"))
            app = app_l.replace("\xa0", "").replace("--", "").replace("… ", " ").replace(" WA  ", "").replace(".             ", ". ").replace(".        .", ".").replace(".         .", ".").replace(".           .", ".").replace(" .", ".").replace(".   ..", ".").replace(" ..", "").replace(".    .   .", ". ")
            T.append(app)
        if fileExtension == '.pdf':
            text =  textract.process(path,method='pdftotext')
            app_l = remove_noise(text.decode("utf-8"))
            app = app_l.replace("\xa0", "").replace("                                      ", "").replace("   ", " ").replace(". D \x0c", ".").replace(".             ", ". ").replace("                                      ", ".").replace(".         .", ".").replace(".           .", ".").replace(" .", ".").replace("             ", " ").replace(" ..", "").replace("                                      ", ". ").replace(".. ", ". ").replace(". ", " ")
            T.append(app)
    return T
jpweytjens commented 5 years ago

Thank you for all the extra information. Even with the test file, I can't reproduce the issue on Windows or Linux.

Could you try running textract without the remove_noise function? I can't see where textract or the underlying dependencies would be causing this issue.

import textract
text = textract.process("test.docx").decode("utf8")
print(text)

(Also, you provide method="tesseract" for the docx extraction, but this gets ignored for docx files as the method is only available for pdf files.)

sylwiaoz commented 5 years ago

Hi Johannes,

Thank you for your help so far.

Running textract alone gives the expected result:

Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S2.2 de l’EkrProCom 50 correspondant.

sylwiaoz commented 5 years ago

The thing I do not understrand though is if I run textract using the function below, I get a string without numbers...

text=extract_text_self('test.docx') text ['Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S. de l’EkrProCom correspondant.']

image

jpweytjens commented 5 years ago

Are you sure the output is coming from extract_tex_self as shown in your screenshot? Your output is a single item list ['Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S. de l’EkrProCom correspondant.'], while the function in the screenshot defines list T, but returns text which is a string.

If the simple test code in my previous comment works, than, from what I can tell, the problem lies with any of the functions you use after textract has processed the docx file.

sylwiaoz commented 5 years ago

Yes the output is coming from extract_text_self. I just forgot to remove T when I removed the lines related to remove_noise.

Thank you for your help again. As the issue does not seem to be related to textract after all, I am closing it.

Cheers Sylwia