This PR rounds the floating point number associated with coordinates in pdfminer_processing.py. This helps to eliminate machine precision caused randomness in bounding box overlap detection. Currently the rounding is set to the nearest machine precision for np.float32 using np.finfo(float), which yields resolution = 1e-15.
future work
We should reduce the rounding to only 6 digits after floating point since the data type float32 has a resolution of only 1e-6. However it would break tests. A followup is required to tune the threshold values in pdfminer_processing.py so that it works with 1e-6 resolution.
This PR rounds the floating point number associated with coordinates in
pdfminer_processing.py
. This helps to eliminate machine precision caused randomness in bounding box overlap detection. Currently the rounding is set to the nearest machine precision fornp.float32
usingnp.finfo(float)
, which yields resolution =1e-15
.future work
We should reduce the rounding to only 6 digits after floating point since the data type
float32
has a resolution of only1e-6
. However it would break tests. A followup is required to tune the threshold values inpdfminer_processing.py
so that it works with1e-6
resolution.