KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Intercept extraction results #116

Open cas4 opened 6 years ago

cas4 commented 6 years ago

I'm revisiting a project that utilized an old, 2010 version of TikaOnDotNet (see http://clarify.dovetailsoftware.com/kmiller/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm/). I'm looking at migrating the code to use the latest NuGet packages, however a lot has changed. I originally modified the code to exclude the header and footer text of a Word document by having the TextExtractor class instantiate a custom BodyContentHandler class which in turn inherits from org.apache.tika.sax.BodyContentHandler (see Stack Overflow Question).

Is it possible to use a non-default content handler using the updated TikaOnDotNet code?

Thanks in advance.