Sicos1977 / TesseractOCR

A .net library to work with Google's Tesseract
167 stars 21 forks source link

how to set tessedit_create_hocr 1 tessedit_pageseg_mode 4 in TesseractOCR #19

Closed pjoshi90 closed 2 years ago

pjoshi90 commented 2 years ago

i want to set below hocr setting witj TessractOcr tessedit_create_hocr 1tessedit_pageseg_mode 4

you can find hocr file which we can placed inside tessdata folder hocr.txt

Sicos1977 commented 2 years ago

You can set all these mode through code. You don't have to place this inside the tessdata folder.

First create the engine with the constructor

        /// <summary>
        ///     Creates a new instance of <see cref="Engine" /> with the specified <paramref name="engineMode" /> and
        ///     <paramref name="configFiles" />.
        /// </summary>
        /// <remarks>
        ///     <para>
        ///         The <paramref name="dataPath" /> parameter should point to the directory that contains the 'tessdata' folder
        ///         for example if your tesseract language data is installed in <c>C:\Tesseract\tessdata</c> the value of datapath
        ///         should
        ///         be <c>C:\Tesseract</c>. Note that tesseract will use the value of the <c>TESSDATA_PREFIX</c> environment
        ///         variable if defined,
        ///         effectively ignoring the value of <paramref name="dataPath" /> parameter.
        ///     </para>
        /// </remarks>
        /// <param name="dataPath">
        ///     The path to the parent directory that contains the 'tessdata' directory, ignored if the
        ///     <c>TESSDATA_PREFIX</c> environment variable is defined.
        /// </param>
        /// <param name="language">The <see cref="Language"/> to load</param>
        /// <param name="engineMode">The <see cref="EngineMode" /> value to use when initializing the tesseract engine</param>
        /// <param name="configFiles">
        ///     An optional sequence of tesseract configuration files to load, encoded using UTF8 without BOM
        ///     with Unix end of line characters you can use an advanced text editor such as Notepad++ to accomplish this.
        /// </param>
        /// <param name="initialOptions"></param>
        /// <param name="setOnlyNonDebugVariables"></param>
        /// <param name="logger">When set then logging is written to this <see cref="ILogger"/> interface</param>
        public Engine(string dataPath, Language language, EngineMode engineMode = EngineMode.Default, IEnumerable<string> configFiles = null, IDictionary<string, object> initialOptions = null, bool setOnlyNonDebugVariables = false, ILogger logger = null)
        {
            if (logger != null)
                Logger.LoggerInterface = logger;

            DefaultPageSegMode = PageSegMode.Auto;
            _handle = new HandleRef(this, TessApi.Native.BaseApiCreate());

            Initialize(dataPath, new List<Language> {language}, engineMode, configFiles, initialOptions, setOnlyNonDebugVariables, logger);
        }

After that you can set the page seg mode

        /// <summary>
        ///     Processes the specific image.
        /// </summary>
        /// <remarks>
        ///     You can only have one result iterator open at any one time.
        /// </remarks>
        /// <param name="image">The image to process.</param>
        /// <param name="inputName">Sets the input file's name, only needed for training or loading a uzn file.</param>
        /// <param name="pageSegMode">The page layout analysis method to use.</param>
        public Page Process(Pix.Image image, string inputName, PageSegMode? pageSegMode = null)
        {
            return Process(image, inputName, new Rect(0, 0, image.Width, image.Height), pageSegMode);
        }