Something wrong in this code

lauglam commented 2 years ago

Something wrong in this code.

I can't catch this exception, maybe because it's RuntimeDllImport

https://github.com/Sicos1977/TesseractOCR/blob/3abe128b3434f1d4675948dac6bdcc5d88d8a4ed/TesseractOCR/Layout/EnumeratorBase.cs#L220

Sicos1977 commented 2 years ago

You cant catch it with a normal try catch?

lauglam commented 2 years ago

You cant catch it with a normal try catch?

yes, it can't be caught

lauglam commented 2 years ago

Unable to capture, the program crashes directly here

lauglam commented 2 years ago

This is my test image test_img.zip

Sicos1977 commented 2 years ago

At the moment I'm trying to make a new nuget package with Tesseract 5.1 in it, I'll let you know when it is done so you could try that one.

Sicos1977 commented 2 years ago

I just updated the code on GitHub to Tesseract 5.1, try to clone it and see if this version solves your problem.

Sicos1977 commented 2 years ago

https://github.com/Sicos1977/TesseractOCR/commit/6419b56ceb4e7e9c1102d9fa7aace662582a4852 and 0d21da2606285c2a61aad5143a4274b1c1ee6a81

Sicos1977 commented 2 years ago

I just released a new nuget package with Tesseract updated to version 5.1

lauglam commented 2 years ago

Unfortunately, the error still exists

lauglam commented 2 years ago

This is the program I use for testing

ConsoleApp1.zip

Sicos1977 commented 2 years ago

Can you start Tesseract.exe without any problems?

Sicos1977 commented 2 years ago

Nevermind I think that this is your problem. You are disposing the page object and thus destroing the reference to the Blocks object.

Sicos1977 commented 2 years ago

This works without any problems:

    static void Main(string[] args)
    {
        var result = new StringBuilder();
        using var engine = new TesseractOCR.Engine(@".\", Language.English, EngineMode.Default);
        using var pix = TesseractOCR.Pix.Image.LoadFromFile(@".\test_img.png");
        using var page = engine.Process(pix);
        foreach (var block in page.Layout)
        {
            result.AppendLine($"Block confidence: {block.Confidence}");
            if (block.BoundingBox != null)
            {
                var boundingBox = block.BoundingBox.Value;
                result.AppendLine($"Block bounding box X1 '{boundingBox.X1}', Y1 '{boundingBox.Y2}', X2 " +
                                  $"'{boundingBox.X2}', Y2 '{boundingBox.Y2}', width '{boundingBox.Width}', height '{boundingBox.Height}'");
            }
            result.AppendLine($"Block text: {block.Text}");
        }

        Console.WriteLine(result.ToString());
    }

Do not dispose the object before you are done using it, because it will destroy all the references to the Tesseract51.dll and thus giving you the error !!!

Sicos1977 commented 2 years ago

I like the mange strip drawing style :-)

Sicos1977 commented 2 years ago

If your goal is to just get the text from the page then you also can use page.Text

lauglam commented 2 years ago

It's my fault, thank you very much for correcting me.

Sicos1977 commented 2 years ago

No problems, same happend to me when I started using Tesseract :-) ... it is good to make mistakes... you learn from it.

lauglam commented 2 years ago

Thank you for reminding. Forgive my bad English, thanks again

Sicos1977 commented 2 years ago

Your English is fine, I'm also not a native English talking person (I'm from the Netherlands) so I guess real English people have something to comment about me also :-)

lauglam commented 2 years ago

I wrote something like this and it works fine, thanks again.

public static IEnumerable<Block> GetBlocks(string path, Language language = Language.English)
{
    // ReSharper disable once StringLiteralTypo
    var engine = new Engine(@".\trained_data", language, EngineMode.Default);
    var pix = TesseractOCR.Pix.Image.LoadFromFile(path);
    var page = engine.Process(pix);

    return page.Layout;
}

public static IEnumerable<Paragraph> GetParagraphs(string path, Language language = Language.English)
{
    var blocks = GetBlocks(path, language);
    return from block in blocks from paragraph in block.Paragraphs select paragraph;
}

public static IEnumerable<TextLine> GetTextLines(string path, Language language = Language.English)
{
    var paragraphs = GetParagraphs(path, language);
    return from paragraph in paragraphs from textLine in paragraph.TextLines select textLine;
}

public static IEnumerable<Word> GetWords(string path, Language language = Language.English)
{
    var textLines = GetTextLines(path, language);
    return from textLine in textLines from word in textLine.Words select word;
}

public static IEnumerable<Symbol> GetSymbols(string path, Language language = Language.English)
{
    var words = GetWords(path, language);
    return from word in words from symbol in word.Symbols select symbol;
}

Sicos1977 / TesseractOCR

Something wrong in this code #4