Closed Freakwill closed 5 years ago
AFAIK the simplest way to do what you want is to use the base exporter:
from pydocx.export.base import PyDocXExporter
with open('test.docx') as fp:
text = ''.join(result for result in PyDocXExporter(fp).export())
print(text)
cc @Freakwill
from pydocx.export.base import PyDocXExporter with open('test.docx') as fp: text = ''.join(result for result in PyDocXExporter(fp).export()) print(text)
it could not read the newline symbol '\n'. If I want to keep the symbol, what should I do?
Then you need to create your own exporter:
from pydocx.export.base import PyDocXExporter
class RawExporter(PyDocXExporter):
def apply_newlines(self, nodes):
if nodes:
return '\n'.join(node for node in nodes)
return ''
def export_paragraph(self, paragraph):
nodes = super(RawExporter, self).export_paragraph(paragraph)
return self.apply_newlines(nodes)
def export_break(self, br):
nodes = super(RawExporter, self).export_break(br)
return self.apply_newlines(nodes)
with open('test.docx') as fp:
output = ''.join(result for result in RawExporter(fp).export())
print(output)
Is there some cheap method/API to extract the pure text from doc files? something like
read_text(doc)
.