Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

feat: round numbers to reduce undeterministic behavior #3740

Closed badGarnet closed 1 month ago

badGarnet commented 1 month ago

This PR rounds the floating point number associated with coordinates in pdfminer_processing.py. This helps to eliminate machine precision caused randomness in bounding box overlap detection. Currently the rounding is set to the nearest machine precision for np.float32 using np.finfo(float), which yields resolution = 1e-15.

future work

We should reduce the rounding to only 6 digits after floating point since the data type float32 has a resolution of only 1e-6. However it would break tests. A followup is required to tune the threshold values in pdfminer_processing.py so that it works with 1e-6 resolution.