With docassemble's built in OCRing, it looked like it was taking > 3 hours to OCR a 700 page PDF that I uploaded as an attachment. But when looking closer, it seemed to be consistently processing the pages over and over again.
Need to have some concrete answers on how longs things should take. We could always just turn off things for > 300 pages as an option.
Need to also figure out where to make users wait, since I assume they will want to know that their thing has been filed (I think now we are just silently using the non-OCR'd, which probably isn't a good idea).
upstream in AssemblyLine, we've added OCRmyPDF, which skips pages without text already on them. This lets people upload PDFs that are very large, but they shouldn't be uploading many pages that don't have text already on them (they can upload pictures, but it has to be one at a time, and (hopefully) isn't 100's of pictures).
things do still take a long time sometimes, but it's overall okay, and isn't retriggered when you go to the next page (unlike as_pdf).
users now will wait before e-filing the forms, which wasn't happening before.
Nothing much else to do here. I'll close this with #61, but in general, we should keep an eye out for other ways to improve OCR and PDF performance.
With docassemble's built in OCRing, it looked like it was taking > 3 hours to OCR a 700 page PDF that I uploaded as an attachment. But when looking closer, it seemed to be consistently processing the pages over and over again.
Need to have some concrete answers on how longs things should take. We could always just turn off things for > 300 pages as an option.
Need to also figure out where to make users wait, since I assume they will want to know that their thing has been filed (I think now we are just silently using the non-OCR'd, which probably isn't a good idea).