DS4SD / docling-core

A python library to define and validate data types in Docling.
MIT License
25 stars 5 forks source link

Refactor the document types and add hierarchy #19

Closed PeterStaar-IBM closed 2 weeks ago

PeterStaar-IBM commented 1 month ago

Refactor the document types and add hierarchy

IMPORTANT: Still to be refined

Currently, we have two main document-types,

  1. MinimalDocument
  2. CCSDocument
  3. ExportedCCSDocument

These types originate from legacy work related to CCS. These different document types refer to the types used int CCS of Deep Search, which is primarily focused on converting PDF documents.

As we intend to tackle different types of documents, we need to have more generic and capable document definitions. In general, we want to support the ability to reflect,

  1. Hierarchy of the document: understand which item is a child of another document item (eg list-items of a list, paragraphs of a section, etc)
  2. Ability to separate the main-content from meta-data
  3. Treat paginated/layout aware documents as a derived type from a generic document.

As such, we propose to have two different types of documents,

  1. SimpleDocument: This document type will lay out the basic features of the most minimal document and implement hierarchy as well as all the export functions.
  2. LayoutDocument: Inherits from simple-document and contains adds all the layout information of the SimpleDocument. The LayoutDocument is intended to be used for PDF, PowerPoint and image (PNG/JPEG/TIFF) formats.

Definitions

The SimpleDocument has the following top-level fields,

  1. description: Describes the document and contains information either not found in the document or extracted and curated.
  2. file-info: Provides all the file related information
  3. s3_date: Provides all the information to retrieve binary data from object store
  4. body: containing the body of the document (texts/tables/figures)
  5. meta: containing other stuff (page-headers/footers, water-marks, etc)
  6. texts: Texts contain everything that can be fundamentally represented as a string. This includes headers, paragraphs, equations, code, etc.
  7. tables: Tables contain everything that has a 2D grid structure.
  8. figures: Figures contain everything that has a binary representation.
  9. key_values: Represents everything with

Every text element has,

  1. orig: original text
  2. text: renormalized text (eg removed hidden spaces, normalised minus signs, etc)
  3. dloc: document location, defined as <document-hash>#<json-path>
  4. hash: hash of dloc into uint64
  5. type: label such as header, text, code, equation, caption, ...
  6. parent: uint64 hash of the parent. Every text element can only have 1 parent
  7. children: array of uint64 hash
  8. prov: array of page-elements None if not Layout)

Every table element has,

  1. data: 2D array of text-elements
  2. dloc: document location, defined as <document-hash>#<json-path>
  3. hash: hash of dloc into uint64
  4. type: label of "table"
  5. parent: uint64 hash of the parent. Every text element can only have 1 parent
  6. children: array of uint64 hash
  7. prov: array of page-elements None if not Layout)
  8. caption: array of uint64 hash of text-elements
  9. footnotes: array of uint64 hash of text-elements
  10. mentions: array of uint64 hash of text-elements

The LayoutDocument has the following top-level fields,

  1. page_elements: contains the elements of each page/image with page-number and bounding box. it also contains a reference to the body/meta item with a char/coordinate span (text/table) if the page-element splits text/table across columns or pages.
  2. page_dimensions: contains the width and height of each page

Work Items

cau-git commented 1 month ago

This is a draft proposal (written in YAML) to address several of the above inputs.

Some additional considerations embedded:

--- 
## Empty document
description: {} # DescriptionType - TBD
file_info: # FileInfoType - TBD
  document_hash: "xyz"
furniture: [] # instead of "meta". Typesetter's term for Headers, footers, framing, navigation elements, all other non-body text
body: [] # All elements in other arrays, by-reference only
texts: [] # All elements that have a text-string representation, with actual data
tables: [] # All tables...
figures: [] # All figures...
key_value_items: [] # All KV-items

---
## Document with content + layout info
description: { } # DescriptionType - TBD
file_info: # FileInfoType - TBD
  document_hash: e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5
furniture: # Headers, footers, framing, navigation elements, all other non-body text
  - $ref: "/texts/0"

body: # All elements in other arrays, by-reference only
  - $ref: "/texts/1"
  - $ref: "/figure/0"
  - $ref: "/texts/2"
  - $ref: "/texts/3"
  - $ref: "/tables/0"

texts: # All elements that have a text-string representation, with actual data
  - orig: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
    text: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/0"
    hash: 132103230
    label: "page_header"
    parent: null
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 21.3
          t: 52.3
          b: 476.2
          r: 35.2
        charspan: [ 1,423 ] # 2-tuple, references to "orig"
  - orig: "DocLayNet: A Large Human-Annotated Dataset for\nDocument-Layout Analysis"
    text: "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/1"
    hash: 2349732 # uint64 hash of dloc
    label: "title"
    parent: null
    children: [ ]
    prov: # must exist, can be empty
      - page_no: 1
        bbox:
          l: 65.0
          t: 30.1
          b: 53.4
          r: 623.2
        charspan: [ 1,423 ] # 2-tuple, references to "orig"
  - orig: "OPERATION (cont.)" # nested inside the figure
    text: "OPERATION (cont.)"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/2"
    hash: 6978483
    label: "section_header"
    parent:
      $ref: "/figures/0"
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
        charspan: [ 0,734 ]
  - orig: "Figure 1: Four examples of complex page layouts across dif-\nferent document categories" # nested inside the figure
    text: "Figure 1: Four examples of complex page layouts across different document categories"
    dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/3"
    hash: 6978483
    label: "caption"
    parent:
      $ref: "/figures/0"
    children: [ ]
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
          coord_origin: "BOTTOMLEFT"
        charspan: [ 1,423 ] # 2-tuple, references to "orig"

tables: # All tables...
  - dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/table/0"
    hash: 98574
    label: "table"
    parent: null
    children: [ ]
    caption:
      $ref: "/texts/3"
    references:
      - $ref: "/text/??"
    footnotes:
      - $ref: "/text/??"
    image:
      format: png
      dpi: 72
      size:
        width: 231
        height: 351
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/tables/0.png"
      #alternatives: base64 encoded striong
    data: # TableData Type
      grid: [ [ ] ] # list-of-list of TableCell type
      otsl: "<fcel><ecel>..." # OTSL token string
      html: "" # ??
    prov:
      - page_no: 1
        bbox:
          l: 323.0
          t: 354.3
          b: 334.4
          r: 376.0
          coord_origin: "BOTTOMLEFT"
        charspan: [ 1,423 ] # 2-tuple, references to "orig"

figures: # All figures...
  - dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/figures/0"
    hash: 7782482
    label: "figure"
    parent: null
    caption:
      $ref: "/texts/2"
    references:
      - $ref: "/text/??"
    footnotes:
      - $ref: "/text/??"

    data: # FigureData Type
      classification: "illustration"
      confidence: 0.78
      description: "...."
      # content structure?
    image:
      format: png
      dpi: 72
      size:
        width: 231
        height: 351
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/figures/0.png"
      #alternatives: base64 encoded striong
    children:
      - $ref: "/texts/2"
    prov:
      - page_no: 1
        bbox:
          l: 456.3
          t: 145.8
          b: 623.4
          r: 702.5
        charspan: [ 0,288 ]

key_value_items: [ ] # All KV-items

# We should consider this for pages
pages: # Optional, for layout documents
  1:
    hash: "5b0916ed3ead46e69efcddb2c932afd91d0e25ce6828c39e5617e6ee2bd0cf6e"
    size:
      width: 768.23
      height: 583.15
    image:
      format: png
      dpi: 144
      size:
        width: 1536
        height: 1166
      uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/pages/1.png"
      #alternatives: base64 encoded string
    num_elements: 23
cau-git commented 1 month ago

Draft PR with implementation of above proposal here: https://github.com/DS4SD/docling-core/pull/21/files

PeterStaar-IBM commented 2 weeks ago

done in v2 with DoclingDocument