IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
138 stars 106 forks source link

[Bug] Chunking is missing some text from bullet section #579

Closed sujee closed 1 week ago

sujee commented 1 week ago

Search before asking

Component

Other

What happened + What you expected to happen

Example pdf file earth.pdf is transformed using pdf2parquet. This one is parsing the document correctly.

pdf2parquet output

But chunking is missing some text from last section.

Last segment

Basic facts about Earth:
• Distance from the Sun: Average of 149.6 million kilometers (93 million
miles)
• Rotation Period: 24 hours (one day)
• Moons: One moon, called Luna or simply “the Moon”.

Chunking output

Earth
· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)
· Rotation Period: 24 hours (one day)

Note how the last bullet item is missing

chunking output

Reproduction script

code to reproduce : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

Anything else

could be issue with bullets segment or last item on the page?

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

sujee commented 1 week ago

not happening in dev3 release. closing for now