bokuweb / docx-rs

:memo: A .docx file writer with Rust/WebAssembly.
https://bokuweb.github.io/docx-rs/
MIT License
334 stars 57 forks source link

How do I detect tables and their contents? #651

Closed Mrodent closed 12 months ago

Mrodent commented 1 year ago

This is about reading .docx files rather than writing them.

I have some lines like this (based on this page):

let data: Value = serde_json::from_str(&read_docx(&read_to_vec(file_name)?)?.json())?;
if let Some(children) = data["document"]["children"].as_array() {
    children.iter().for_each(|node| {
        let n = read_children(node);
        n_words_in_docx += n;
        ()
    });
    ...

The idea of this is that you get an array of "nodes". The nodes can themselves have child nodes, and the function read_children calls itself recursively. Despite that, no text in tables in the Word document is identified. That's my main question, but I'm also not sure about the handling of headers, footers, footnotes, comments, text boxes, watermarks... I want to sweep up all the text in the file if possible.

NB I'm a Rust uber-newb, but I have now cloned your repo, and am currently taking a look at reader/mod.rs fn read_docx and also reader/read_zip.rs fn read_zip... is it possible that one of these fails to parse (document.xml?) as it should?

With my document (consisting of just one table), I get just two nodes. Neither produces any text, using the method in the above page. So then I looked at the json produced:

println!("read_children...\n{}", serde_json::to_string_pretty(node).unwrap());
read_children...
{
  "data": {
    "grid": [
      4928,
      4926
    ],
    "hasNumbering": false,
    "property": {
      "borders": {
        "bottom": null,
        "insideH": null,
        "insideV": null,
        "left": null,
        "right": null,
        "top": null
      },
      "justification": "left",
      "style": "TableGrid",
      "width": {
        "width": 0,
        "widthType": "auto"
      }
    },
    "rows": [
      {
        "data": {
          "cells": [
            {
              "data": {
                "children": [
                  {
                    "data": {
                      "children": [
                        {
                          "data": {
                            "children": [
                              {
                                "data": {
                                  "preserveSpace": true,
                                  "text": "CHAPTER I. "
                                },
                                "type": "text"
                              }
                            ],
                            "runProperty": {}
                          },
                          "type": "run"
                        }
                      ],
                      "hasNumbering": false,
                      "id": "00000001",
                      "property": {
                        "indent": {
                          "end": null,
                          "firstLineChars": null,
                          "hangingChars": null,
                          "specialIndent": {
                            "type": "firstLine",
                            "val": 0
                          },
                          "start": 0,
                          "startChars": null
                        },
                        "runProperty": {},
                        "tabs": []
                      }
                    },
                    "type": "paragraph"
                  }
                ],
                "hasNumbering": false,
                "property": {
                  "borders": null,
                  "gridSpan": null,
                  "shading": null,
                  "textDirection": null,
                  "verticalAlign": null,
                  "verticalMerge": null,
                  "width": {
                    "width": 4928,
                    "widthType": "dxa"
                  }
                }
              },
              "type": "tableCell"
            },
            {
              "data": {
                "children": [
                  {
...

"text": "CHAPTER I. " in the above is not identified as text. ... is it possible that the parsing of such a file is not recursively exploring "rows", "cells" and "children" keys? Way out of my depth now.

Mrodent commented 12 months ago

Closing this because I think I've now worked out that the problem lies with the method read_children in the linked page (which has nothing to do with your project). It seems like you have to explore the elements not only of node["data"]["children"], but also of node["data"]["rows"] and node["data"]["cells"] ... and possibly other things which that 3rd-party page hasn't identified...