Closed washcycle closed 1 year ago
I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool to split the tiff in seperate files first and then feed them to TesseractOCR
Fair point.
Excellent idea though, I never even considered that as an option.
Regards, Matt
On Sat, Nov 26, 2022 at 10:09 AM Kees @.***> wrote:
I dropped support for multi-page tiff images in favor of making this library much easier to use. Just use another tool the split the tiff in seperate files first and then feed them to TesseractOCR
— Reply to this email directly, view it on GitHub https://github.com/Sicos1977/TesseractOCR/issues/23#issuecomment-1328072542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIGT7TT6R4473W4NX5ZF6LWKIY2HANCNFSM6AAAAAARUL6KYA . You are receiving this because you authored the thread.Message ID: @.***>
Can you please reconsider this? Splitting first introduces considerable overhead.
I can add the method that is mentioned in the first post in this issue and after that you have to feed the pix object to the OCR engine.... but I'm not going to change the ocr classes because I dropped support for multi page tiffs so that this library was much easier to use.
That would be great. having the ability to load a multiple page tiff and iterate through the images is all I need. I can fid the individual Images into tesseract myself.
Thank you for reconsidering this!😊
I'll try to make some time to implements it this weekend.
Is it possible to supply me with a multi-page tiff?
Let me know if I can help with the implementation.
Helps is always welcome, at the moment time is my issue. I'll try to implement the new feature in the next week. First need to finish some other work.
what kind of API did you have in mind?
Just using leptonica to split the multipage tiff in seperate PIX (image) objects and feed them into the Tesseract engine one by one.
Sorry for the long long delay but I added this method to the Array class
/// <summary>
/// Loads the multi-page tiff from the memory <paramref name="bytes"/>
/// </summary>
/// <param name="bytes"></param>
/// <returns></returns>
public static Array LoadMultiPageTiffFromMemory(byte[] bytes)
{
IntPtr pixaHandle;
fixed (byte* ptr = bytes)
{
pixaHandle = LeptonicaApi.Native.pixaReadMemMultipageTiff(ptr, bytes.Length);
}
if (pixaHandle == IntPtr.Zero) throw new IOException("Failed to load multi page image from memory");
return new Array(pixaHandle);
}
You can use it like this to read a multi page tiff image from memory
[TestMethod]
public void CanParseMultiPageTifFromMemory()
{
using var engine = CreateEngine();
var bytes = File.ReadAllBytes(TestFilePath("./processing/multi-page.tif"));
using var pixA = TesseractOCR.Pix.Array.LoadMultiPageTiffFromMemory(bytes);
var i = 1;
foreach (var pix in pixA)
{
using (var page = engine.Process(pix))
{
var text = page.Text.Trim();
var expectedText = $"Page {i}";
Assert.AreEqual(text, expectedText);
}
i++;
}
}
Looking to use this leptonica to read multipage tiffs from memory.
https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c
Would we only need to update the Interop class to add this?