CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Eazy api to extract the pure text from doc files #252

Closed Freakwill closed 5 years ago

Freakwill commented 5 years ago

Is there some cheap method/API to extract the pure text from doc files? something like read_text(doc).

IuryAlves commented 5 years ago

AFAIK the simplest way to do what you want is to use the base exporter:

from pydocx.export.base import PyDocXExporter

with open('test.docx') as fp:
    text = ''.join(result for result in PyDocXExporter(fp).export())
    print(text)

cc @Freakwill

Freakwill commented 5 years ago
from pydocx.export.base import PyDocXExporter

with open('test.docx') as fp:
    text = ''.join(result for result in PyDocXExporter(fp).export())
    print(text)

it could not read the newline symbol '\n'. If I want to keep the symbol, what should I do?

IuryAlves commented 5 years ago

Then you need to create your own exporter:

from pydocx.export.base import PyDocXExporter

class RawExporter(PyDocXExporter):

    def apply_newlines(self, nodes):
        if nodes:
            return '\n'.join(node for node in nodes)
        return ''

    def export_paragraph(self, paragraph):
        nodes = super(RawExporter, self).export_paragraph(paragraph)
        return self.apply_newlines(nodes)

    def export_break(self, br):
        nodes = super(RawExporter, self).export_break(br)
        return self.apply_newlines(nodes)

with open('test.docx') as fp:
    output = ''.join(result for result in RawExporter(fp).export())
    print(output)