KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
197 stars 72 forks source link

sample code for just extract text from files #90

Closed waleedraza786 closed 7 years ago

waleedraza786 commented 7 years ago

Hello, Dear i m at very beginner stage is TikaOnDotNet can you please provide simple example code to extract text from images and other MS office files because when i tried it so i successfully extract text from one pdf file but it didn't get text from those images which are inside pdf file. And when i extract text from single image by converting it into byte array then provide array to extract() method so all meta data extracted except Text which are inside the image and when I tried other MS office files then it gives exception that cannot extract text from this file. I extracted from this way:

//Following code for image Image img = Image.FromFile(@"E:\solr-6.4.1\upload\abc.jpeg"); byte[] arr; using (MemoryStream ms = new MemoryStream()) { img.Save(ms, ImageFormat.Jpeg); arr = ms.ToArray(); } var extractionResult = new TextExtractor().Extract(arr);

    Response.Write(extractionResult);

//Following code for pdf and other files

var extractionResult = new TextExtractor().Extract("path of a particular file");

    Response.Write(extractionResult);
KevM commented 7 years ago

I believe if you want to get text from images you need to use Tesseract check out issue #62

KevM commented 7 years ago

I am a little confused by this PR. Not sure how my automation branch got linked to it. Do check out #62 if you want to do OCR on text in images.