Open hardchor opened 1 week ago
@hardchor yes, we've thought of doing that. Unfortunately, detecting whether headers are present and how long they are is really something that would need a model of its own to do reliably.
I'm changing this to an enhancement since the current behavior is the expected behavior. We'll track this and see how it fits into the roadmap.
@hardchor re: the TableChunk
bit, I think you'll find that whenever a Table
(Python) element is large enough to need splitting that it ends up as two or more TableChunk
objects. However, when serialized to JSON, both Table
and TableChunk
elements get "type": "Table"
, so the two Python element-types look the same in JSON form.
Describe the bug When chunking text with tables in them (using the
by_title
strategy), tables are split into chunks row-by-row (ifmax_characters
is set sufficiently low). That's great (and aligns with best practices where each row should ideally be in its own chunk). However, now the chunk loses all context for the data in that table row. Since that context can typically be found in the table header (i.e. typically the first row), I am currently manually going through all rows and prepend the table header (can provide code if needed, but it's not the prettiest solution since I essentially have to parse thetext_as_html
output and then stitch it back together).P.S.: I also couldn't get it to produce
TableChunk
elements, but maybe that's not intended behaviour in this case?To Reproduce Run ingestion of any document with a table in it and chunk it using the
by_title
strategy and a sufficiently smallmax_characters
size).Expected behavior