ShayHill / docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.
https://docx2python.readthedocs.io/en/latest/
MIT License
163 stars 34 forks source link

Is there any way to extract the table into markdown format? #57

Closed ChanghaoLau closed 3 months ago

ChanghaoLau commented 6 months ago

I want to extract the table in .docx file into markdown format, while maintaining the position of the table in the document. So I can't use python-docx document.paragraghs and document.tables to handle paragraghs and tables separately (this will destory the positional relationship between them).

docx2python is very easy to use. I would like to know whether docx2python can save tables in markdown format, or whether it can separate tables, images and paragraphs in output.body. Thank you!

ShayHill commented 6 months ago

I am going to leave this issue open for a bit and thing about how this might be seamlessly accomplished. Until then, here’s a script that will identify tables for you.

https://github.com/ShayHill/transpose_docx_tables

ShayHill commented 3 months ago

As of Docx2Python v 3.0.0, tables are guaranteed to be nxm (n rows by m columns) and are straightforward to identify. See details near the top of the README file. I've also left an example of exporting tables as markdown in the tests folder. It's referenced in the README.