IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
314 stars 135 forks source link

update doc_chunk md results #799

Closed dolfim-ibm closed 1 week ago

dolfim-ibm commented 1 week ago

Why are these changes needed?

This LlamaIndex release changed the results of the markdown chunker.

Old results

0   2206.01062.pdf          9  ...  DocLayNet: A Large Human-Annotated Dataset for...  875e11c6859ca5c8805142c3340fd38cd8dd1e017da27f...
1   2206.01062.pdf          9  ...  ABSTRACT\n\nAccurate document layout analysis ...  ffbfc0bb3667eea222723ffb7595782f91a28e3095de27...
2   2206.01062.pdf          9  ...  CCS CONCEPTS\n\n· Information systems→Document...  ac71c10c2dbeec68a56bf3f1bbd884b991b5b84d8807d7...

New results

0   2206.01062.pdf          9  ...  ## DocLayNet: A Large Human-Annotated Dataset ...  56fc3e076ae76c3eff1085f3ce63357dddaeaf22d61425...
1   2206.01062.pdf          9  ...  ## ABSTRACT\n\nAccurate document layout analys...  c694e7087fe810432af8d2ced118f666918b777b9c5edb...
2   2206.01062.pdf          9  ...  ## CCS CONCEPTS\n\n· Information systems→Docum...  60a1d90fb66f0e5774b6f4bf0d37326514a7e8ee830533...