Sicos1977 / TesseractOCR

A .net library to work with Google's Tesseract
167 stars 21 forks source link

Word.FontAttributes is null although its point size is calculated #27

Closed vsolominov closed 1 year ago

vsolominov commented 1 year ago

In my app i need to know the point size of recognized word. In current Tesseract version (5.2) Word.FontAttributes is always null. It is so because this property is created from pointer which is not assigned in ltrresultiterator.cpp since DISABLED_LEGACY_ENGINE is defined, but point size is calculated nevertheless and returned as out parameter.

Is there any way to get calculated point size?

image

Sicos1977 commented 1 year ago

Is this option available through the api that I implemented? If so then I can expose it to C# for you. If it is not in there then not. Also don't know if we can compile it back into the the tesseract dll

Sicos1977 commented 1 year ago

I added a property called FontPointSize on the Words class. This does what you need. All the other information is not supported anymore and so I removed them.

Just get the latest nuget package

See this for more information --> https://github.com/tesseract-ocr/tesseract/issues/1074

vsolominov commented 1 year ago

Thanks a lot!

But I would not be so categorical with FontAttributes, since this property is not always null. For example, if the solution uses EngineMode equal to TesseractOnly or TesseractAndLstm (that is, legacy mode), then the font parameters are initialized and FontAttributes will not be empty. Font options can be very useful for custom text rendering. It might be better to decorate the font information like this:

public class FontProperties
{
    public int PointSize { get; }
    public FontAttributes? FontAttributes { get; }

    public FontProperties(int pointSize, FontAttributes? fontAttributes = null)
    {
        this.PointSize = pointSize;
        this.FontAttributes = fontAttributes ;
    }
}

public FontProperties FontProperties
{
    get
    {
        var nameHandle =
            TessApi.Native.ResultIteratorWordFontAttributes(
                IteratorHandleRef,
                out var isBold, out var isItalic, out var isUnderlined,
                out var isMonospace, out var isSerif, out var isSmallCaps,
                out var pointSize, out var fontId);

        FontAttributes fontAttributes = null;

        // This can happen in certain error conditions or legacy mode
        if (nameHandle != IntPtr.Zero)
        {
            var fontName = MarshalHelper.PtrToString(nameHandle, Encoding.UTF8);
            var fontInfo = new FontInfo(fontName, fontId, isItalic, isBold, isMonospace, isSerif);
            fontAttributes = new FontAttributes(fontInfo, isUnderlined, isSmallCaps);
         }

         return FontProperties(pointSize, fontAttributes);
    }
}
Sicos1977 commented 1 year ago

You are right, I missed that one

Sicos1977 commented 1 year ago

I liked your solution and implemented that one. See the latest nuget package.

Just curious ... for what are you using Tesseract OCR?

vsolominov commented 1 year ago

Wow, cool, thanks!

I'm using Tesseract to recognize PDF files without a text layer to create a searcheble PDF. Due to a variety of reasons (image preprocessing, saving the quality of the original PDF, and others) I can't use the PDF rendering tool that Tesseract provides.