But chunking is missing some text from last section.
Last segment
Basic facts about Earth:
• Distance from the Sun: Average of 149.6 million kilometers (93 million
miles)
• Rotation Period: 24 hours (one day)
• Moons: One moon, called Luna or simply “the Moon”.
Chunking output
Earth
· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)
· Rotation Period: 24 hours (one day)
Search before asking
Component
Other
What happened + What you expected to happen
Example pdf file earth.pdf is transformed using pdf2parquet. This one is parsing the document correctly.
pdf2parquet output
But chunking is missing some text from last section.
Last segment
Chunking output
Note how the last bullet item is missing
chunking output
Reproduction script
code to reproduce : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb
Anything else
could be issue with bullets segment or last item on the page?
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?