Refactor the document types and add hierarchy

IMPORTANT: Still to be refined

Currently, we have two main document-types,

These types originate from legacy work related to CCS. These different document types refer to the types used int CCS of Deep Search, which is primarily focused on converting PDF documents.

As we intend to tackle different types of documents, we need to have more generic and capable document definitions. In general, we want to support the ability to reflect,

Hierarchy of the document: understand which item is a child of another document item (eg list-items of a list, paragraphs of a section, etc)
Ability to separate the main-content from meta-data
Treat paginated/layout aware documents as a derived type from a generic document.

As such, we propose to have two different types of documents,

SimpleDocument: This document type will lay out the basic features of the most minimal document and implement hierarchy as well as all the export functions.
LayoutDocument: Inherits from simple-document and contains adds all the layout information of the SimpleDocument. The LayoutDocument is intended to be used for PDF, PowerPoint and image (PNG/JPEG/TIFF) formats.

Definitions

The SimpleDocument has the following top-level fields,

description: Describes the document and contains information either not found in the document or extracted and curated.
file-info: Provides all the file related information
s3_date: Provides all the information to retrieve binary data from object store
body: containing the body of the document (texts/tables/figures)
meta: containing other stuff (page-headers/footers, water-marks, etc)
texts: Texts contain everything that can be fundamentally represented as a string. This includes headers, paragraphs, equations, code, etc.
tables: Tables contain everything that has a 2D grid structure.
figures: Figures contain everything that has a binary representation.
key_values: Represents everything with

Every text element has,

orig: original text
text: renormalized text (eg removed hidden spaces, normalised minus signs, etc)
dloc: document location, defined as <document-hash>#<json-path>
hash: hash of dloc into uint64
type: label such as header, text, code, equation, caption, ...
parent: uint64 hash of the parent. Every text element can only have 1 parent
children: array of uint64 hash
prov: array of page-elements None if not Layout)

Every table element has,

data: 2D array of text-elements
dloc: document location, defined as <document-hash>#<json-path>
hash: hash of dloc into uint64
type: label of "table"
parent: uint64 hash of the parent. Every text element can only have 1 parent
children: array of uint64 hash
prov: array of page-elements None if not Layout)
caption: array of uint64 hash of text-elements
footnotes: array of uint64 hash of text-elements
mentions: array of uint64 hash of text-elements

The LayoutDocument has the following top-level fields,

page_elements: contains the elements of each page/image with page-number and bounding box. it also contains a reference to the body/meta item with a char/coordinate span (text/table) if the page-element splits text/table across columns or pages.
page_dimensions: contains the width and height of each page

Work Items

[ ] Remove mentions of CCS
[ ] Add special fields for hierarchy
[ ] Add functionality to easily create a new document and export
[ ] Refactor DocumentTokens enum class
[ ] Remove name and type from text elements

This is a draft proposal (written in YAML) to address several of the above inputs.

Some additional considerations embedded:

Whatever is not the body goes into furniture (see Wikipedia)
pages is an extra dictionary with information to reference into the others (only on layout documents)
- All top-level keys should repeat in pages
resources like images can be attached through referencing external file URIs
No explicit page_elements, that information is encoded inside the text elements prov
No field names with dashes!
bounding box information must always be expressed with l, t, r, b and coord_origin fields inside, not as a tuple.

--- 
## Empty document
description: {} # DescriptionType - TBD
file_info: # FileInfoType - TBD
  document_hash: "xyz"
furniture: [] # instead of "meta". Typesetter's term for Headers, footers, framing, navigation elements, all other non-body text
body: [] # All elements in other arrays, by-reference only
texts: [] # All elements that have a text-string representation, with actual data
tables: [] # All tables...
figures: [] # All figures...
key_value_items: [] # All KV-items

---
## Document with content + layout info
description: { } # DescriptionType - TBD
file_info: # FileInfoType - TBD
  document_hash: e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5
furniture: # Headers, footers, framing, navigation elements, all other non-body text
  - $ref: "/texts/0"

body: # All elements in other arrays, by-reference only
  - $ref: "/texts/1"
  - $ref: "/figure/0"
  - $ref: "/texts/2"
  - $ref: "/texts/3"
  - $ref: "/tables/0"

texts: # All elements that have a text-string representation, with actual data
  - orig: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
    text: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/0"
    hash: 132103230
    label: "page_header"
    parent: null
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 21.3
          t: 52.3
          b: 476.2
          r: 35.2
        charspan: [ 1,423 ] # 2-tuple, references to "orig"
  - orig: "DocLayNet: A Large Human-Annotated Dataset for\nDocument-Layout Analysis"
    text: "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/1"
    hash: 2349732 # uint64 hash of dloc
    label: "title"
    parent: null
    children: [ ]
    prov: # must exist, can be empty
      - page_no: 1
        bbox:
          l: 65.0
          t: 30.1
          b: 53.4
          r: 623.2
        charspan: [ 1,423 ] # 2-tuple, references to "orig"
  - orig: "OPERATION (cont.)" # nested inside the figure
    text: "OPERATION (cont.)"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/2"
    hash: 6978483
    label: "section_header"
    parent:
      $ref: "/figures/0"
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
        charspan: [ 0,734 ]
  - orig: "Figure 1: Four examples of complex page layouts across dif-\nferent document categories" # nested inside the figure
    text: "Figure 1: Four examples of complex page layouts across different document categories"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/3"
    hash: 6978483
    label: "caption"
    parent:
      $ref: "/figures/0"
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
          coord_origin: "BOTTOMLEFT"
        charspan: [ 1,423 ] # 2-tuple, references to "orig"

tables: # All tables...
  - dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/table/0"
    hash: 98574
    label: "table"
    parent: null
    children: [ ]
    caption:
      $ref: "/texts/3"
    references:
      - $ref: "/text/??"
    footnotes:
      - $ref: "/text/??"
    image:
      format: png
      dpi: 72
      size:
        width: 231
        height: 351
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/tables/0.png"
      #alternatives: base64 encoded striong
    data: # TableData Type
      grid: [ [ ] ] # list-of-list of TableCell type
      otsl: "<fcel><ecel>..." # OTSL token string
      html: "" # ??
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
          coord_origin: "BOTTOMLEFT"
        charspan: [ 1,423 ] # 2-tuple, references to "orig"

figures: # All figures...
  - dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/figures/0"
    hash: 7782482
    label: "figure"
    parent: null
    caption:
      $ref: "/texts/2"
    references:
      - $ref: "/text/??"
    footnotes:
      - $ref: "/text/??"

    data: # FigureData Type
      classification: "illustration"
      confidence: 0.78
      description: "...."
      # content structure?
    image:
      format: png
      dpi: 72
      size:
        width: 231
        height: 351
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/figures/0.png"
      #alternatives: base64 encoded striong
    children:
      - $ref: "/texts/2"
    prov:
      - page_no: 1
        bbox:
          l: 456.3
          t: 145.8
          b: 623.4
          r: 702.5
        charspan: [ 0,288 ]

key_value_items: [ ] # All KV-items

# We should consider this for pages
pages: # Optional, for layout documents
  1:
    hash: "5b0916ed3ead46e69efcddb2c932afd91d0e25ce6828c39e5617e6ee2bd0cf6e"
    size:
      width: 768.23
      height: 583.15
    image:
      format: png
      dpi: 144
      size:
        width: 1536
        height: 1166
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/pages/1.png"
      #alternatives: base64 encoded string
    num_elements: 23

DS4SD / docling-core