KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 74 forks source link

Configuring Tesseract OCR for TikaOnDotNet #62

Open LeeBear35 opened 7 years ago

LeeBear35 commented 7 years ago

The hope here is to get TikaOnDotNet fully configured to access Tesseract OCR for text extraction from images. With Tika .93 support for Tesseract was added, and we are now in the midst of validating the latest release Tika 1.13.1. A big set of validations center around Tika's ability to handle certain types of PDF files, it should be noted that TIFF images in PDFBox have changes due to licensing issues that are not in compliance with the Apache license.

So here is hoping that if we cannot read it one way, we might be able to read it using another.

The first step has been to extend Kevin's TextExtractor so that Meta data can be passed in to assist the parsing that set of extensions is here:

public static class TikaOnDotNetExtensions
{
    private static TikaConfig config = TikaConfig.getDefaultConfig();
    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath, string ContentType)
    {
      TextExtractionResult result = te.Extract
        (
          metadata =>
          {
            metadata.add("resourceName", System.IO.Path.GetFileName(filePath));
            metadata.add("FilePath", filePath);
            try
            {
              if (!ContentType.Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
              {
                metadata.add("Content-Type", ContentType);
              }
              else
              {
                Detector detector = config.getDetector();
                using (org.apache.tika.io.TikaInputStream inputStream = org.apache.tika.io.TikaInputStream.@get(data, metadata))
                {
                  MediaType foundType = detector.detect(inputStream, metadata);
                  if (!foundType.toString().Equals("application/octet-stream", StringComparison.CurrentCultureIgnoreCase))
                  {
                    metadata.add("Content-Type", foundType.toString());
                  }
                }
              }
            }
            catch (Exception ex)
            {
              throw ex;
            }

            return TikaInputStream.get(data, metadata);
          }
        );

      return result;
    }

    public static TextExtractionResult Extract(this TextExtractor te, byte[] data, string filePath)
    {
      return te.Extract(data, filePath, "application/octet-stream");
    }
}

The next step has been to dump the configuration to confirm how Tika is configured, and what changes might need to be made, the dump routine was added to the class above:

    public static string TikaConfigDump()
    {
      StringBuilder retVal = new StringBuilder();

      retVal.AppendFormat("{0}\t{1}\n\n", "Version", (new org.apache.tika.Tika(config)).toString());

      retVal.AppendLine("\nDetectors");

      CompositeDetector configDetector = (CompositeDetector)config.getDetector();
      var detectors = configDetector.getDetectors().toArray();
      foreach (Detector detector in detectors)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)detector).getClass().getName());

        if (detector.GetType() == typeof(CompositeDetector))
        {
          var subDetectors = configDetector.getDetectors().toArray();
          foreach (Detector subDetector in subDetectors)
          {
            retVal.AppendFormat("\t\t{0}\n", ((java.lang.Object)subDetector).getClass().getName());
          }
        }
      }

      retVal.AppendLine("\nParsers");

      CompositeParser configParser = (CompositeParser)config.getParser();
      var parsers = configParser.getAllComponentParsers().toArray();
      foreach (Parser parser in parsers)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)parser).getClass().getName());

        var parserTypes = parser.getSupportedTypes(new ParseContext()).toArray();
        foreach (MediaType mediaType in parserTypes)
        {
          retVal.AppendFormat("\t\t{0}\n", mediaType.toString());
        }
      }

      org.apache.tika.language.translate.Translator translator = config.getTranslator();
      if (translator.isAvailable())
      {
        retVal.AppendFormat("Translator {0}\n", ((java.lang.Object)translator).getClass().getName());
      }

      return retVal.ToString();
    }

On my system using the default configuration provided by Kevin you can see the setup below:

Version Apache Tika 1.13

Detectors org.apache.tika.parser.microsoft.POIFSContainerDetector org.apache.tika.parser.pkg.ZipContainerDetector org.gagravarr.tika.OggDetector org.apache.tika.mime.MimeTypes

Parsers org.apache.tika.parser.asm.ClassParser application/java-vm org.apache.tika.parser.audio.AudioParser audio/x-wav audio/basic audio/x-aiff org.apache.tika.parser.audio.MidiParser application/x-midi audio/midi org.apache.tika.parser.chm.ChmParser application/vnd.ms-htmlhelp application/x-chm application/chm org.apache.tika.parser.code.SourceCodeParser text/x-c++src text/x-groovy text/x-java-source org.apache.tika.parser.crypto.Pkcs7Parser application/pkcs7-signature application/pkcs7-mime org.apache.tika.parser.dif.DIFParser application/dif+xml org.apache.tika.parser.dwg.DWGParser image/vnd.dwg org.apache.tika.parser.epub.EpubParser application/x-ibooks+zip application/epub+zip org.apache.tika.parser.executable.ExecutableParser application/x-msdownload application/x-sharedlib application/x-elf application/x-object application/x-executable application/x-coredump org.apache.tika.parser.external.CompositeExternalParser org.apache.tika.parser.feed.FeedParser application/atom+xml application/rss+xml org.apache.tika.parser.font.AdobeFontMetricParser application/x-font-adobe-metric org.apache.tika.parser.font.TrueTypeParser application/x-font-ttf org.apache.tika.parser.gdal.GDALParser application/x-gsc image/x-ozi application/x-pds image/eir application/x-usgs-dem application/aaigrid application/x-bag application/elas application/x-rs2 application/x-tsx application/x-lcp image/geotiff application/x-mbtiles application/x-cappi application/x-netcdf application/x-gsag application/x-epsilon application/x-ace2 application/jaxa-pal-sar image/x-pcraster application/x-msgn image/arg application/x-hdf image/x-mff application/x-kro image/x-hdf5-image image/x-dimap image/x-srp image/big-gif application/x-envi application/x-cosar application/x-ntv2 image/bmp application/x-doq2 application/x-bt application/x-kml application/x-gmt application/x-rst application/vrt application/pcisdk application/x-ctg application/x-e00-grid application/x-rik image/ida image/x-mff2 application/sdts-raster application/x-snodas image/jp2 image/sar-ceos application/terragen application/x-wcs application/leveller application/x-ingr application/x-gtx image/sgi application/x-pnm image/raster application/fits application/x-r image/gif application/x-envi-hdr application/x-http application/x-rmf application/x-ecrg-toc application/aig application/x-rpf-toc image/adrg application/x-srtmhgt application/x-generic-bin application/jdem image/x-airsar application/x-webp application/x-ngs-geoid application/x-pcidsk image/x-fujibas application/x-wms application/x-map image/ceos application/xpm application/x-zmap image/envisat application/x-ers application/x-doq1 application/x-isis2 application/x-nwt-grd application/x-ppi image/ilwis application/x-isis3 application/x-nwt-grc application/x-blx application/gff application/x-ndf image/jpeg application/x-geo-pdf application/x-l1b image/fit application/x-gsbg application/x-sdat application/x-ctable2 application/x-grib application/x-coasp application/x-dipex application/grass-ascii-grid image/fits application/x-til application/x-dods image/png application/x-gxf application/x-gs7bg application/x-cpg application/x-lan application/x-xyz image/bsb application/x-p-aux application/dted application/x-rasterlite image/nitf image/hfa application/x-fast application/x-los-las org.apache.tika.parser.geo.topic.GeoParser application/geotopic org.apache.tika.parser.geoinfo.GeographicInformationParser text/iso19139+xml org.apache.tika.parser.grib.GribParser application/x-grib2 org.apache.tika.parser.hdf.HDFParser application/x-hdf org.apache.tika.parser.html.HtmlParser text/html application/vnd.wap.xhtml+xml application/x-asp application/xhtml+xml org.apache.tika.parser.image.BPGParser image/bpg image/x-bpg org.apache.tika.parser.image.ICNSParser image/icns org.apache.tika.parser.image.ImageParser image/png image/vnd.wap.wbmp image/bmp image/x-xcf image/gif image/x-icon image/x-ms-bmp org.apache.tika.parser.image.PSDParser image/vnd.adobe.photoshop org.apache.tika.parser.image.TiffParser image/tiff org.apache.tika.parser.image.WebPParser image/webp org.apache.tika.parser.iptc.IptcAnpaParser text/vnd.iptc.anpa org.apache.tika.parser.isatab.ISArchiveParser application/x-isatab org.apache.tika.parser.iwork.IWorkPackageParser application/vnd.apple.keynote application/vnd.apple.iwork application/vnd.apple.numbers application/vnd.apple.pages org.apache.tika.parser.jdbc.SQLite3Parser org.apache.tika.parser.journal.JournalParser application/pdf org.apache.tika.parser.jpeg.JpegParser image/jpeg org.apache.tika.parser.mail.RFC822Parser message/rfc822 org.apache.tika.parser.mat.MatParser application/x-matlab-data org.apache.tika.parser.mbox.MboxParser application/mbox org.apache.tika.parser.mbox.OutlookPSTParser application/vnd.ms-outlook-pst org.apache.tika.parser.microsoft.JackcessParser application/x-msaccess org.apache.tika.parser.microsoft.OfficeParser application/x-tika-msoffice-embedded; format=ole10_native application/msword application/vnd.visio application/vnd.ms-project application/x-tika-msworks-spreadsheet application/x-mspublisher application/vnd.ms-powerpoint application/x-tika-msoffice application/sldworks application/x-tika-ooxml-protected application/vnd.ms-excel application/vnd.ms-outlook org.apache.tika.parser.microsoft.OldExcelParser application/vnd.ms-excel.workspace.3 application/vnd.ms-excel.workspace.4 application/vnd.ms-excel.sheet.2 application/vnd.ms-excel.sheet.3 application/vnd.ms-excel.sheet.4 org.apache.tika.parser.microsoft.TNEFParser application/vnd.ms-tnef application/x-tnef application/ms-tnef org.apache.tika.parser.microsoft.ooxml.OOXMLParser application/vnd.ms-word.document.macroenabled.12 application/vnd.ms-excel.addin.macroenabled.12 application/x-tika-ooxml application/vnd.openxmlformats-officedocument.wordprocessingml.template application/vnd.ms-powerpoint.addin.macroenabled.12 application/vnd.openxmlformats-officedocument.spreadsheetml.template application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.presentationml.template application/vnd.ms-powerpoint.slideshow.macroenabled.12 application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.ms-powerpoint.presentation.macroenabled.12 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet application/vnd.openxmlformats-officedocument.presentationml.slideshow application/vnd.ms-excel.template.macroenabled.12 application/vnd.ms-excel.sheet.macroenabled.12 application/vnd.ms-word.template.macroenabled.12 org.apache.tika.parser.mp3.Mp3Parser audio/mpeg org.apache.tika.parser.mp4.MP4Parser video/x-m4v application/mp4 video/3gpp video/3gpp2 video/quicktime audio/mp4 video/mp4 org.apache.tika.parser.netcdf.NetCDFParser application/x-netcdf org.apache.tika.parser.ocr.TesseractOCRParser org.apache.tika.parser.odf.OpenDocumentParser application/x-vnd.oasis.opendocument.presentation application/vnd.oasis.opendocument.chart application/x-vnd.oasis.opendocument.text-web application/x-vnd.oasis.opendocument.image application/vnd.oasis.opendocument.graphics-template application/vnd.oasis.opendocument.text-web application/x-vnd.oasis.opendocument.spreadsheet-template application/vnd.oasis.opendocument.spreadsheet-template application/vnd.sun.xml.writer application/x-vnd.oasis.opendocument.graphics-template application/vnd.oasis.opendocument.graphics application/vnd.oasis.opendocument.spreadsheet application/x-vnd.oasis.opendocument.chart application/x-vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.image application/x-vnd.oasis.opendocument.text application/x-vnd.oasis.opendocument.text-template application/vnd.oasis.opendocument.formula-template application/x-vnd.oasis.opendocument.formula application/vnd.oasis.opendocument.image-template application/x-vnd.oasis.opendocument.image-template application/x-vnd.oasis.opendocument.presentation-template application/vnd.oasis.opendocument.presentation-template application/vnd.oasis.opendocument.text application/vnd.oasis.opendocument.text-template application/vnd.oasis.opendocument.chart-template application/x-vnd.oasis.opendocument.chart-template application/x-vnd.oasis.opendocument.formula-template application/x-vnd.oasis.opendocument.text-master application/vnd.oasis.opendocument.presentation application/x-vnd.oasis.opendocument.graphics application/vnd.oasis.opendocument.formula application/vnd.oasis.opendocument.text-master org.apache.tika.parser.pdf.PDFParser application/pdf org.apache.tika.parser.pkg.CompressorParser application/zlib application/x-gzip application/x-bzip2 application/x-compress application/x-java-pack200 application/gzip application/x-bzip application/x-xz org.apache.tika.parser.pkg.PackageParser application/x-tar application/java-archive application/x-archive application/zip application/x-cpio application/x-tika-unix-dump application/x-7z-compressed org.apache.tika.parser.pkg.RarParser application/x-rar-compressed org.apache.tika.parser.pot.PooledTimeSeriesParser org.apache.tika.parser.rtf.RTFParser application/rtf org.apache.tika.parser.txt.TXTParser text/plain org.apache.tika.parser.video.FLVParser video/x-flv org.apache.tika.parser.xml.DcXMLParser application/xml image/svg+xml org.apache.tika.parser.xml.FictionBookParser application/x-fictionbook+xml org.gagravarr.tika.FlacParser audio/x-oggflac audio/x-flac org.gagravarr.tika.OggParser audio/ogg application/kate application/ogg video/daala video/x-ogguvs video/x-ogm audio/x-oggpcm video/ogg video/x-dirac video/x-oggrgb video/x-oggyuv org.gagravarr.tika.OpusParser audio/opus audio/ogg; codecs=opus org.gagravarr.tika.SpeexParser audio/ogg; codecs=speex audio/speex org.gagravarr.tika.TheoraParser video/theora org.gagravarr.tika.VorbisParser audio/vorbis

The next set of steps will be configuring and testing Tesseract prior to integrating it in Tika.

KevM commented 7 years ago

Thanks for creating this issue and looking into exposing this potentially useful feature of Tika and Tesseract.

LeeBear35 commented 7 years ago

After installing Tesseract I used pbrush to create a test image containing Hello World and saved it to bmp, gif, jpg, png and tif.

As a baseline I ran these files through Tesseract to make sure that Hello World was the text each file extracted. The GIF file failed because the drive I was running Tesseract on did not have a TMP directory at the root. Tesseract should be using the system temporary directory, but this is a bug in the current release.

After a couple false starts I finally was able to get it working correctly. Here are the steps:

    1. Create a TesseractOCRConfig object
  1. Call the setTesseractPath on that object passing in the installation path for Tesseract
  2. On the ParseContext object call set passing in the TypeOf TesseractOCRConfig, and the Config object

Parsing then starts using that Tesseract Parser

I refactored Kevin's TextExtractor so that it can be called using:

TikaOnDotNet.TextExtractionOCR.TextExtractor textExtractor = new TikaOnDotNet.TextExtractionOCR.TextExtractor();
textExtractor.TesseractPath = @"E:\Tesseract";
TextExtractionResult Actual = textExtractor.Extract(buffer, testFile, mimeType);

Here is the entire class:

using System;
using System.Linq;
using java.io;
using javax.xml.transform;
using javax.xml.transform.sax;
using javax.xml.transform.stream;
using org.apache.tika.io;
using org.apache.tika.metadata;
using org.apache.tika.parser;
using Exception = System.Exception;
using TikaOnDotNet.TextExtraction;
using org.apache.tika.config;
using org.apache.tika.detect;
using org.apache.tika.mime;
using org.apache.tika.parser.ocr;

namespace TikaOnDotNet.TextExtractionOCR
{
  public interface ITextExtractor
  {
    /// <summary>
    /// Extract text from a given filepath.
    /// </summary>
    /// <param name="filePath">File path to be extracted.</param>
    TextExtractionResult Extract(string filePath);

    /// <summary>
    /// Extract text from a byte[]. This is a good way to get data from arbitrary sources.
    /// </summary>
    /// <param name="data">A byte array of data which will have its text extracted.</param>
    TextExtractionResult Extract(byte[] data);

    /// <summary>
    /// Extract text from a byte[]. This is a good way to get data from arbitrary sources.
    /// </summary>
    /// <param name="data">A byte array of data which will have its text extracted.</param>
    /// <param name="filePath">A string containing the file name to help the detector determine the proper parser</param>
    /// <param name="ContentType">A string that has the mime type to help the detector determine the correct parser to use</param>
    TextExtractionResult Extract(byte[] data, string filePath, string ContentType);

    /// <summary>
    /// Extract text from a URI. Time to create your very of web spider.
    /// </summary>
    /// <param name="uri">URL which will have its text extracted.</param>
    TextExtractionResult Extract(Uri uri);

    /// <summary>
    /// Under the hood we are using Tika which is a Java project. Tika wants an java.io.InputStream. The other overloads eventually call this Extract giving this method a Func.
    /// </summary>
    /// <param name="streamFactory">A Func which takes a Metadata object and returns an InputStream.</param>
    /// <returns></returns>
    TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory);
  }

  public class TextExtractor : ITextExtractor
  {
    private static TikaConfig config = TikaConfig.getDefaultConfig();
    private TesseractOCRConfig tesseractOCRConfig;
    private static string tesseractPath = string.Empty;
    public string TesseractPath 
    { 
      get { return tesseractPath; } 
      set 
      { 
        tesseractPath = value;
        tesseractOCRConfig = new TesseractOCRConfig();
        //todo: validate directory and tesseract.exe at location
        tesseractOCRConfig.setTesseractPath(tesseractPath);
      } 
    }
    public bool IsOCRPathEnabled 
    { 
      get { return tesseractOCRConfig != null; } 
      set
      {
        if (value)
        {
          tesseractOCRConfig = new TesseractOCRConfig();
          tesseractOCRConfig.setTesseractPath(tesseractPath);
        }
        else
        {
          tesseractOCRConfig = null;
        }
      }
    }
    public TextExtractionResult Extract(string filePath)
    {
      try
      {
        var inputStream = new FileInputStream(filePath);
        return Extract(metadata =>
        {
          var result = TikaInputStream.get(inputStream);
          metadata.add("FilePath", filePath);
          return result;
        });
      }
      catch (Exception ex)
      {
        throw new TextExtractionException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
      }
    }
    public TextExtractionResult Extract(byte[] data)
    {
      return Extract(data, string.Empty, string.Empty);
    }

    public TextExtractionResult Extract(byte[] data, string filePath, string ContentType)
    {
      TextExtractionResult result = Extract
        (
          metadata =>
          {
            metadata.add(org.apache.tika.metadata.TikaMetadataKeys.__Fields.RESOURCE_NAME_KEY, System.IO.Path.GetFileName(filePath));
            metadata.add(org.apache.tika.metadata.TikaMimeKeys.__Fields.TIKA_MIME_FILE, filePath);
            try
            {
              if (!ContentType.Equals(org.apache.tika.mime.MimeTypes.OCTET_STREAM, StringComparison.CurrentCultureIgnoreCase))
              {
                metadata.add(org.apache.tika.metadata.HttpHeaders.__Fields.CONTENT_TYPE, ContentType);
              }
              else
              {
                Detector detector = config.getDetector();
                using (org.apache.tika.io.TikaInputStream inputStream = org.apache.tika.io.TikaInputStream.@get(data, metadata))
                {
                  MediaType foundType = detector.detect(inputStream, metadata);
                  if (!foundType.toString().Equals(org.apache.tika.mime.MimeTypes.OCTET_STREAM, StringComparison.CurrentCultureIgnoreCase))
                  {
                    metadata.add(org.apache.tika.metadata.HttpHeaders.__Fields.CONTENT_TYPE, foundType.toString());
                  }
                }
              }
            }
            catch (Exception ex)
            {
              throw ex;
            }

            return TikaInputStream.get(data, metadata);
          }
        );

      return result;
    }

    public TextExtractionResult Extract(Uri uri)
    {
      var jUri = new java.net.URI(uri.ToString());
      return Extract(metadata =>
      {
        var result = TikaInputStream.get(jUri, metadata);
        metadata.add("Uri", uri.ToString());
        return result;
      });
    }

    public TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory)
    {
      try
      {
        var parser = new AutoDetectParser();
        var metadata = new Metadata();
        var outputWriter = new StringWriter();
        var parseContext = new ParseContext();

        if (IsOCRPathEnabled)
        {
          parseContext.set(typeof(TesseractOCRConfig), tesseractOCRConfig);
        }

        //use the base class type for the key or parts of Tika won't find a usable parser
        parseContext.set(typeof(Parser), parser);

        using (var inputStream = streamFactory(metadata))
        {
          try
          {
            parser.parse(inputStream, getTransformerHandler(outputWriter), metadata, parseContext);
          }
          finally
          {
            inputStream.close();
          }
        }

        return AssembleExtractionResult(outputWriter.ToString(), metadata);
      }
      catch (Exception ex)
      {
        throw new TextExtractionException("Extraction failed.", ex);
      }
    }

    private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
    {
      var metaDataResult = metadata.names()
        .ToDictionary(name => name, name => string.Join(", ", metadata.getValues(name)));

      var contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
        Text = text,
        ContentType = contentType,
        Metadata = metaDataResult
      };
    }

    private static TransformerHandler getTransformerHandler(Writer output)
    {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      var transformerHandler = factory.newTransformerHandler();

      transformerHandler.getTransformer().setOutputProperty(OutputKeys.METHOD, "text");
      transformerHandler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");

      transformerHandler.setResult(new StreamResult(output));
      return transformerHandler;
    }
    public static string TikaConfigDump()
    {
      System.Text.StringBuilder retVal = new System.Text.StringBuilder();

      retVal.AppendFormat("{0}\t{1}\n", "Version", (new org.apache.tika.Tika(config)).toString());

      retVal.AppendLine("\nDetectors");

      CompositeDetector configDetector = (CompositeDetector)config.getDetector();
      var detectors = configDetector.getDetectors().toArray();
      foreach (Detector detector in detectors)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)detector).getClass().getName());

        if (detector.GetType() == typeof(CompositeDetector))
        {
          var subDetectors = configDetector.getDetectors().toArray();
          foreach (Detector subDetector in subDetectors)
          {
            retVal.AppendFormat("\t\t{0}\n", ((java.lang.Object)subDetector).getClass().getName());
          }
        }
      }

      retVal.AppendLine("\nParsers");

      CompositeParser configParser = (CompositeParser)config.getParser();
      var parsers = configParser.getAllComponentParsers().toArray();
      foreach (Parser parser in parsers)
      {
        retVal.AppendFormat("\t{0}\n", ((java.lang.Object)parser).getClass().getName());

        var parserTypes = parser.getSupportedTypes(new ParseContext()).toArray();
        foreach (MediaType mediaType in parserTypes)
        {
          retVal.AppendFormat("\t\t{0}\n", mediaType.toString());
        }
      }

      org.apache.tika.language.translate.Translator translator = config.getTranslator();
      if (translator.isAvailable())
      {
        retVal.AppendFormat("Translator {0}\n", ((java.lang.Object)translator).getClass().getName());
      }

      retVal.AppendFormat("\nFallback Parser: {0}\n", configParser.getFallback());

      return retVal.ToString();
    }
  }
}
KevM commented 7 years ago

Would you like to submit a PR with this and I can work with you to get this capability into the text extractor?

LeeBear35 commented 7 years ago

I would be happy to.

KevM commented 7 years ago

I'd like to discuss this feature addition a bit. @Sicos1997 was nice enough to roll this feature into PR #72 creating a separate ITextExtractor implementation which works with Tesseract to OCR images and optionally PDFs.

Unfortunately it looks like the Tika integration with Tesseract requires an executable (not a library) to be installed. Here are the [windows instructions(https://github.com/tesseract-ocr/tesseract/wiki#windows].

An unofficial installer for windows for Tesseract 3.05-dev is available from Tesseract at UB Mannheim. This includes the training tools. An installer for the old version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the 'tessdata' directory, probably C:\Program Files\Tesseract OCR\tessdata.

I see a few problems just getting Tesseract installed:

None of this is turn key. So how do we test it? Here is a possible plan...

Add a Tesseract TextExtractor

Testing Concerns

I am not sure how to have our Appveyor CI test the Tesseract integration. There is no chocolately package.

The biggest hurdle I see to having support for this feature is:

The main reason I don't want to move forward is I don't want to manually test this feature. So, until we can automate it I don't want to add it. If someone who is using Tika + Tesseract now via .Net were to step up and help out with the automation I would be happy to work with you on it.

Another option

If someone really wants this feature but is not willing to do the automation required we could start a new Nuget and let someone own the manual testing it would require. I am also happy to facilitate that direction. This said it seems like a Chocolatey package is an equivalent route.

LeeBear35 commented 7 years ago

Kevin, I was able to get the Tesseract working with Tika on dot net. It really is not hard at all. The windows installer basically extracts the files to a folder, then you just have to tell Tika where Tesseract is installed. For testing you can open Paint Brush and Type in Hello World and save it in the various image formats and then run them through to ensure basic Tesseract OCR is working. Here is the call that I make and the class extension that I implemented to set the path. Note: The harder problem I had was turning off Tesseract because it sets environment variables and they are used even without having the application set the path. The quickest way to turn off the Tesseract OCR was to rename the Tesseract folder so Tika could not find it. Hope this helps.

  // Load path for tesseract from PRGX.MT.Tika.dll.config
  TikaOnDotNet.TextExtractionOCR.TextExtractor.EnableAppSettingsTesseractPath(TesseractInstallPath);

Class extension to the Tika on Dot Net code

public class TextExtractor : ITextExtractor { private static TikaConfig config = TikaConfig.getDefaultConfig(); private static TesseractOCRConfig tesseractOCRConfig = null; private static string tesseractPath = string.Empty; public static void EnableAppSettingsTesseractPath(string TesseractInstallPath) { TesseractPath = TesseractInstallPath; if (string.IsNullOrWhiteSpace(tesseractPath)) { tesseractOCRConfig = null; tesseractPath = string.Empty; } else { tesseractOCRConfig = new TesseractOCRConfig(); tesseractOCRConfig.setTesseractPath(tesseractPath); tesseractOCRConfig.setTimeout(240); } } public static string TesseractPath { get { return tesseractPath; } set { if (!string.IsNullOrEmpty(value)) { if (!Directory.Exists(value)) { throw new DirectoryNotFoundException(string.Format("Tesseract Directory not found: {0}", value)); } if (!System.IO.File.Exists(Path.Combine(value, "tesseract.exe"))) { throw new System.IO.FileNotFoundException(string.Format("Could not find tesseract.exe at {0}", tesseractPath)); } tesseractPath = value; } else { tesseractPath = string.Empty; tesseractOCRConfig = null; } } } public static bool IsOCRPathEnabled { get { return tesseractOCRConfig != null; } set { if (value) { tesseractOCRConfig = new TesseractOCRConfig(); tesseractOCRConfig.setTesseractPath(tesseractPath); } else { tesseractOCRConfig = null; } } } }

KevM commented 7 years ago

Thanks, it is useful to see how you got it working.

delagoutte-wanao commented 5 years ago

Hello, I try the code of LeeBear35 and it is only with tesseract 3.05 but not with version 4. is someone able to make tikaondotnet work with tesseract 4? do you think it could be a problem with the version of tika that is deploy with tikaondotnet ?