cilynx / pantomath

Pantomath knows about things
GNU General Public License v3.0
1 stars 0 forks source link

Date parser isn't catching 1/1/1970 type dates #33

Closed cilynx closed 1 year ago

cilynx commented 1 year ago

Seems to only be happening on multi-page documents. Looking at the logs, the magic words/phrases parser may only be running on the first page.

cilynx commented 1 year ago

Document.pages() is returning the wrong number of pages.

cilynx commented 1 year ago

Document.data[] only has one page no matter how long the document.

cilynx commented 1 year ago

Document.data[] is built by running pytesseract.image_to_data() on Document.processed which is an opened PIL.Image. So far as I can tell, image_to_data only works on the first frame of a PIL.Image. Looks like we either need to iterate over the frames or give pytesseract the TIFF file directly instead of through PIL.