Closed PeterStaar-IBM closed 2 weeks ago
This is a draft proposal (written in YAML) to address several of the above inputs.
Some additional considerations embedded:
body
goes into furniture
(see Wikipedia)pages
is an extra dictionary with information to reference into the others (only on layout documents)
pages
page_elements
, that information is encoded inside the text elements prov
l, t, r, b
and coord_origin
fields inside, not as a tuple.---
## Empty document
description: {} # DescriptionType - TBD
file_info: # FileInfoType - TBD
document_hash: "xyz"
furniture: [] # instead of "meta". Typesetter's term for Headers, footers, framing, navigation elements, all other non-body text
body: [] # All elements in other arrays, by-reference only
texts: [] # All elements that have a text-string representation, with actual data
tables: [] # All tables...
figures: [] # All figures...
key_value_items: [] # All KV-items
---
## Document with content + layout info
description: { } # DescriptionType - TBD
file_info: # FileInfoType - TBD
document_hash: e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5
furniture: # Headers, footers, framing, navigation elements, all other non-body text
- $ref: "/texts/0"
body: # All elements in other arrays, by-reference only
- $ref: "/texts/1"
- $ref: "/figure/0"
- $ref: "/texts/2"
- $ref: "/texts/3"
- $ref: "/tables/0"
texts: # All elements that have a text-string representation, with actual data
- orig: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
text: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/0"
hash: 132103230
label: "page_header"
parent: null
children: [ ]
prov:
- page_no: 1
bbox:
l: 21.3
t: 52.3
b: 476.2
r: 35.2
charspan: [ 1,423 ] # 2-tuple, references to "orig"
- orig: "DocLayNet: A Large Human-Annotated Dataset for\nDocument-Layout Analysis"
text: "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis"
dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/1"
hash: 2349732 # uint64 hash of dloc
label: "title"
parent: null
children: [ ]
prov: # must exist, can be empty
- page_no: 1
bbox:
l: 65.0
t: 30.1
b: 53.4
r: 623.2
charspan: [ 1,423 ] # 2-tuple, references to "orig"
- orig: "OPERATION (cont.)" # nested inside the figure
text: "OPERATION (cont.)"
dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/2"
hash: 6978483
label: "section_header"
parent:
$ref: "/figures/0"
children: [ ]
prov:
- page_no: 1
bbox:
l: 323.0
t: 354.3
b: 334.4
r: 376.0
charspan: [ 0,734 ]
- orig: "Figure 1: Four examples of complex page layouts across dif-\nferent document categories" # nested inside the figure
text: "Figure 1: Four examples of complex page layouts across different document categories"
dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/texts/3"
hash: 6978483
label: "caption"
parent:
$ref: "/figures/0"
children: [ ]
prov:
- page_no: 1
bbox:
l: 323.0
t: 354.3
b: 334.4
r: 376.0
coord_origin: "BOTTOMLEFT"
charspan: [ 1,423 ] # 2-tuple, references to "orig"
tables: # All tables...
- dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/table/0"
hash: 98574
label: "table"
parent: null
children: [ ]
caption:
$ref: "/texts/3"
references:
- $ref: "/text/??"
footnotes:
- $ref: "/text/??"
image:
format: png
dpi: 72
size:
width: 231
height: 351
uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/tables/0.png"
#alternatives: base64 encoded striong
data: # TableData Type
grid: [ [ ] ] # list-of-list of TableCell type
otsl: "<fcel><ecel>..." # OTSL token string
html: "" # ??
prov:
- page_no: 1
bbox:
l: 323.0
t: 354.3
b: 334.4
r: 376.0
coord_origin: "BOTTOMLEFT"
charspan: [ 1,423 ] # 2-tuple, references to "orig"
figures: # All figures...
- dloc: "e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5#/figures/0"
hash: 7782482
label: "figure"
parent: null
caption:
$ref: "/texts/2"
references:
- $ref: "/text/??"
footnotes:
- $ref: "/text/??"
data: # FigureData Type
classification: "illustration"
confidence: 0.78
description: "...."
# content structure?
image:
format: png
dpi: 72
size:
width: 231
height: 351
uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/figures/0.png"
#alternatives: base64 encoded striong
children:
- $ref: "/texts/2"
prov:
- page_no: 1
bbox:
l: 456.3
t: 145.8
b: 623.4
r: 702.5
charspan: [ 0,288 ]
key_value_items: [ ] # All KV-items
# We should consider this for pages
pages: # Optional, for layout documents
1:
hash: "5b0916ed3ead46e69efcddb2c932afd91d0e25ce6828c39e5617e6ee2bd0cf6e"
size:
width: 768.23
height: 583.15
image:
format: png
dpi: 144
size:
width: 1536
height: 1166
uri: "file:///e6fc0db2ee6e7165e93c8286ec52e0d19dfa239c2bddcfe96e64dae3de6190b5/pages/1.png"
#alternatives: base64 encoded string
num_elements: 23
Draft PR with implementation of above proposal here: https://github.com/DS4SD/docling-core/pull/21/files
done in v2 with DoclingDocument
Refactor the document types and add hierarchy
IMPORTANT: Still to be refined
Currently, we have two main document-types,
These types originate from legacy work related to CCS. These different document types refer to the types used int CCS of Deep Search, which is primarily focused on converting PDF documents.
As we intend to tackle different types of documents, we need to have more generic and capable document definitions. In general, we want to support the ability to reflect,
As such, we propose to have two different types of documents,
Definitions
The SimpleDocument has the following top-level fields,
Every text element has,
<document-hash>#<json-path>
Every table element has,
<document-hash>#<json-path>
The LayoutDocument has the following top-level fields,
Work Items
name
andtype
from text elements