Update to v3 API to gain PDF OCR functionality

jeffpaul commented 3 years ago

Is your enhancement related to a problem? Please describe. Following on from the work in #111 where we're using the OCR API from Computer Vision v2.1, we should look at updating to the Read API (in either Computer Vision v3.1 or v3.2 that's currently in preview) to gain access to OCR functionality for PDF files. Here's additional details on the Read API, most notably that the free tier would only cover the first two pages of a PDF file.

OCR API currently supports input requirements of:

Supported image formats: JPEG, PNG, GIF, BMP.
Image file size must be less than 4MB.
Image dimensions must be between 50 x 50 and 4200 x 4200 pixels, and the image cannot be larger than 10 megapixels.

Read API currently supports input requirements of:

Supported image formats: JPEG, PNG, BMP, PDF and TIFF.
Please do note MPO (Multi Picture Objects) embedded JPEG files are not supported.
For multi-page PDF and TIFF documents:
- For the free tier, only the first 2 pages are processed.
- For the paid tier, up to 2,000 pages are processed.
Image file size must be less than 50 MB (4 MB for the free tier).
The image/document page dimensions must be at least 50 x 50 pixels and at most 10000 x 10000 pixels.
The PDF file dimensions must be at most 17 x 17 inches, corresponding to Legal or A3 paper sizes and smaller.

Describe the solution you'd like

Designs

Describe alternatives you've considered

Additional context

jeffpaul commented 3 years ago

Note that the Computer Vision API has officially bumped to v3.2: https://azure.microsoft.com/en-us/updates/cognitive-services-new-computer-vision-api-v32-now-generally-available/

jeffpaul commented 3 years ago

We'll now want to consider how this interacts with the Gutenberg PDF inline support: https://wptavern.com/gutenberg-10-5-embeds-pdfs-adds-verse-block-color-options-and-introduces-new-patterns

10up / classifai

Update to v3 API to gain PDF OCR functionality #265