KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
195 stars 73 forks source link

Extract only retrieves name of PDF? #140

Open clemenslinders opened 4 years ago

clemenslinders commented 4 years ago

LS,

I am using Visual Studio C# and when I use: SearchableText = new TextExtractor().Extract(MyActivePdf).Text;

The variable SearchAbleText is usually filled with the text/content of the PDF.

However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.

I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.

There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.

Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?

PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.

Kind regards,

Clemens Linders