I am using Visual Studio C# and when I use:
SearchableText = new TextExtractor().Extract(MyActivePdf).Text;
The variable SearchAbleText is usually filled with the text/content of the PDF.
However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.
I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.
There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.
Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?
PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.
LS,
I am using Visual Studio C# and when I use: SearchableText = new TextExtractor().Extract(MyActivePdf).Text;
The variable SearchAbleText is usually filled with the text/content of the PDF.
However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.
I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.
There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.
Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?
PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.
Kind regards,
Clemens Linders