CenterForOpenScience / pydocx

An extendable docx file format parser and converter
Other
183 stars 55 forks source link

Export content only #240

Open tritium21 opened 7 years ago

tritium21 commented 7 years ago

It would be extremely helpful to me if it were possible to export only the content of with no or tags. I intend to pass the document on to further processing and will provide those parts myself.

bitscompagnie commented 6 years ago

Hello @tritium21,

You could use the pandoc tool to achieve what you are looking for. Once installed, you can convert a document to plain text with the following command in the terminal or command prompt: pandoc test.docx -f docx -t plain -s -o test.txt

Hope the above helps you.

jlward commented 6 years ago

It would not be difficult to create a custom parser that strips out all the tags. It's something we've wanted to include anyway, so if you end up using that approach, PRs are welcome.

IuryAlves commented 5 years ago

I have done something similar:

from pydocx.export.base import PyDocXExporter

class RawExporter(PyDocXExporter):

    def apply_newlines(self, nodes):
        if nodes:
            return '\n'.join(node for node in nodes)
        return ''

    def export_paragraph(self, paragraph):
        nodes = super(RawExporter, self).export_paragraph(paragraph)
        return self.apply_newlines(nodes)

    def export_break(self, br):
        nodes = super(RawExporter, self).export_break(br)
        return self.apply_newlines(nodes)

with open('test.docx') as fp:
    output = ''.join(result for result in RawExporter(fp).export())
    print(output)

@tritium21