KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

Open johnwnowlin opened 2 years ago

johnwnowlin commented 2 years ago

Tika is crashing on a PDF (which has confidential information, sorry can't post). at line 30 of StreamTextExtractor.cs attempting to extract text from the PDF.

var textExtractor = new TextExtractor();
var extraction = textExtractor(@"filename");

Exception details: System.NullReferenceException HResult=0x80004003 Message=Object reference not set to an instance of an object. Source=TikaOnDotNet StackTrace: at org.apache.jempbox.impl.XMLUtil.getStringValue(Element node)

Oddly, even though this code is in a try/finally block it trows an exception. If it would let me catch the exception, we could just ignore this file and keep going.

using (var inputStream = streamFactory(metadata))
{
    try
    {
        parser.parse(inputStream, handler, metadata, parseContext);
    }
    finally
    {
        inputStream.close();
    }
}

I can open the file in adobe. Have saved as new pdf which also fails.

Is it possible to catch this error so the code can keep going?

johnwnowlin commented 2 years ago

The file causing the error came from a Konica copier and appears to be a TIFF parked in a PDF. I suspect this error is related to issues #145 and #142 , only because Tika needs to extract information from a TIFF. I do not see how to add the optional dependencies to the .Net build to see if that is the problem. Does anybody know how that is accomplished?

KevM commented 2 years ago

It would be really nice to get an example that crashes so we could try to correct this issue in future releases.