Sicos1977 / TesseractOCR

A .net library to work with Google's Tesseract
167 stars 21 forks source link

Ability to read MultiPageTiffs from memory #23

Closed washcycle closed 1 year ago

washcycle commented 2 years ago

Looking to use this leptonica to read multipage tiffs from memory.

/*!
 * \brief   pixaReadMemMultipageTiff()
 *
 * \param[in]    data    const; multiple pages; tiff-encoded
 * \param[in]    size    size of cdata
 * \return  pixa, or NULL on error
 *
 * <pre>
 * Notes:
 *      (1) This is an O(n) read-from-memory version of pixaReadMultipageTiff().
 * </pre>
 */
PIXA *
pixaReadMemMultipageTiff(const l_uint8  *data,
                         size_t          size)
{

https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c

Would we only need to update the Interop class to add this?

Sicos1977 commented 1 year ago

I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool to split the tiff in seperate files first and then feed them to TesseractOCR

washcycle commented 1 year ago

Fair point.

Excellent idea though, I never even considered that as an option.

Regards, Matt

On Sat, Nov 26, 2022 at 10:09 AM Kees @.***> wrote:

I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool the split the tiff in seperate files first and then feed them to TesseractOCR

— Reply to this email directly, view it on GitHub https://github.com/Sicos1977/TesseractOCR/issues/23#issuecomment-1328072542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIGT7TT6R4473W4NX5ZF6LWKIY2HANCNFSM6AAAAAARUL6KYA . You are receiving this because you authored the thread.Message ID: @.***>

goldsam commented 1 year ago

Can you please reconsider this? Splitting first introduces considerable overhead.

Sicos1977 commented 1 year ago

I can add the method that is mentioned in the first post in this issue and after that you have to feed the pix object to the OCR engine.... but I'm not going to change the ocr classes because I dropped support for multi page tiffs so that this library was much easier to use.

goldsam commented 1 year ago

That would be great. having the ability to load a multiple page tiff and iterate through the images is all I need. I can fid the individual Images into tesseract myself.

goldsam commented 1 year ago

Thank you for reconsidering this!😊

Sicos1977 commented 1 year ago

I'll try to make some time to implements it this weekend.

Sicos1977 commented 1 year ago

Is it possible to supply me with a multi-page tiff?

washcycle commented 1 year ago

found this one

https://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif

goldsam commented 1 year ago

Let me know if I can help with the implementation.

Sicos1977 commented 1 year ago

Helps is always welcome, at the moment time is my issue. I'll try to implement the new feature in the next week. First need to finish some other work.

goldsam commented 1 year ago

what kind of API did you have in mind?

Sicos1977 commented 1 year ago

Just using leptonica to split the multipage tiff in seperate PIX (image) objects and feed them into the Tesseract engine one by one.

Sicos1977 commented 1 year ago

Sorry for the long long delay but I added this method to the Array class

        /// <summary>
        ///     Loads the multi-page tiff from the memory <paramref name="bytes"/>
        /// </summary>
        /// <param name="bytes"></param>
        /// <returns></returns>
        public static Array LoadMultiPageTiffFromMemory(byte[] bytes)
        {
            IntPtr pixaHandle;

            fixed (byte* ptr = bytes)
            {
                pixaHandle = LeptonicaApi.Native.pixaReadMemMultipageTiff(ptr, bytes.Length);
            }

            if (pixaHandle == IntPtr.Zero) throw new IOException("Failed to load multi page image from memory");

            return new Array(pixaHandle);
        }

You can use it like this to read a multi page tiff image from memory

        [TestMethod]
        public void CanParseMultiPageTifFromMemory()
        {
            using var engine = CreateEngine();
            var bytes = File.ReadAllBytes(TestFilePath("./processing/multi-page.tif"));
            using var pixA = TesseractOCR.Pix.Array.LoadMultiPageTiffFromMemory(bytes);
            var i = 1;

            foreach (var pix in pixA)
            {
                using (var page = engine.Process(pix))
                {
                    var text = page.Text.Trim();

                    var expectedText = $"Page {i}";
                    Assert.AreEqual(text, expectedText);
                }

                i++;
            }
        }