numbers ignored - Githubissues

sylwiaoz commented 5 years ago

Hello,

When extracting content from docx files, numbers are ignored. I mean not only section numbers but also those that are part of sentences. Is there a way to overcome this problem ? I don't mind losing section numbers but the rest is important for me to keep.

I would appreciate if you could help with that.

Cheers Sylwia

jpweytjens commented 5 years ago

Textract uses python-docx2txt which can't extract numbers from section titles. Numbers in plain text should be extracted though. Can you provide the following information to further investigate the issue?

OS: [e.g. Windows 10]
Textract version [e.g. 1.6.3]
Python version [e.g. 3.7]
Virtual environment (yes/no)
The docx file (if possible)

An upcoming version of textract will add pandoc as a docx parsing method. Pandoc can extract text that python-docx2txt can not such as footnotes.

sylwiaoz commented 5 years ago

OS: Ubuntu 18.04 LTS (on Windows 10) Textract version: 1.6.3 Python version: 3.6.8 Virtual environment: no The docx file: cannot provide it, sorry. Just to give an example, the sequence

micro-switch S2.2 de l’EkrProCom 50 correspondant" (it's French)

is extracted as

micro-switch S . de l’EkrProCom correspondant..

jpweytjens commented 5 years ago

I can not reproduce this issue with any of the word files that I have acces to. Is it possible to share a small section of the document? For example, a copy of the original file with only the sentence "micro-switch S2.2 de l’EkrProCom 50 correspondant"? Can you also share the code that you use to process the docx file? A screenshot is also fine.

I think the issue might be related to the (encoding of) the docx file or how the extracted text is displayed, hence why I ask for a small section of the file or a screenshot.

sylwiaoz commented 5 years ago

Here's a docx with the sentence. Nevermind the fact that the sentence is spit in two in my example. It's due to a sentence recognition function. When run textract on the docx containing only the original sentence, I am still getting a result without numbers: ... micro-switch S. de l’EkrProCom correspondant.

test.docx

def extract_text_self(path):

    basename = os.path.basename(path)
    fileName, fileExtension = os.path.splitext(path)

    #extraction du texte
    T = []
    if '#' not in basename or '~' not in basename:
        if fileExtension == '.docx':
            text =  textract.process(path,method='tesseract')
            app_l = remove_noise(text.decode("utf-8"))
            app = app_l.replace("\xa0", "").replace("--", "").replace("… ", " ").replace(" WA  ", "").replace(".             ", ". ").replace(".        .", ".").replace(".         .", ".").replace(".           .", ".").replace(" .", ".").replace(".   ..", ".").replace(" ..", "").replace(".    .   .", ". ")
            T.append(app)
        if fileExtension == '.pdf':
            text =  textract.process(path,method='pdftotext')
            app_l = remove_noise(text.decode("utf-8"))
            app = app_l.replace("\xa0", "").replace("                                      ", "").replace("   ", " ").replace(". D \x0c", ".").replace(".             ", ". ").replace("                                      ", ".").replace(".         .", ".").replace(".           .", ".").replace(" .", ".").replace("             ", " ").replace(" ..", "").replace("                                      ", ". ").replace(".. ", ". ").replace(". ", " ")
            T.append(app)
    return T

jpweytjens commented 5 years ago

Thank you for all the extra information. Even with the test file, I can't reproduce the issue on Windows or Linux.

Could you try running textract without the remove_noise function? I can't see where textract or the underlying dependencies would be causing this issue.

import textract
text = textract.process("test.docx").decode("utf8")
print(text)

(Also, you provide method="tesseract" for the docx extraction, but this gets ignored for docx files as the method is only available for pdf files.)

sylwiaoz commented 5 years ago

Hi Johannes,

Thank you for your help so far.

Running textract alone gives the expected result:

Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S2.2 de l’EkrProCom 50 correspondant.

sylwiaoz commented 5 years ago

The thing I do not understrand though is if I run textract using the function below, I get a string without numbers...

text=extract_text_self('test.docx') text ['Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S. de l’EkrProCom correspondant.']

jpweytjens commented 5 years ago

Are you sure the output is coming from extract_tex_self as shown in your screenshot? Your output is a single item list ['Si le moteur se déplace dans le mauvais sens, il faut inverser le micro-switch S. de l’EkrProCom correspondant.'], while the function in the screenshot defines list T, but returns text which is a string.

If the simple test code in my previous comment works, than, from what I can tell, the problem lies with any of the functions you use after textract has processed the docx file.

sylwiaoz commented 5 years ago

Yes the output is coming from extract_text_self. I just forgot to remove T when I removed the lines related to remove_noise.

Thank you for your help again. As the issue does not seem to be related to textract after all, I am closing it.

Cheers Sylwia

deanmalmgren / textract

numbers ignored #305