This PR bumps unstructured-inference to 0.8.0, which introduces vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache.
A few document ingest results are changed:
two places for biomed-api (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions)
the layout parser paper now outputs the code lines with page number inside the code box as list items
This PR bumps
unstructured-inference
to0.8.0
, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache.A few document ingest results are changed:
biomed-api
(actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions)