Closed abhi004 closed 7 years ago
I confirmed this and am working on a fix. Not sure why the native Tika mechanism is working weirdly.
It is easy to do this your self. Just download the bytes of the web page and call the Extract(byte[]) overload. On Wed, Mar 29, 2017 at 11:45 PM abhi004 notifications@github.com wrote:
Hi , I am writing HTML formated email to Azure blob and trying to read the blob content by passing the uri but TikaOnDotNet extractor is returning nothing
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/92, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGHdUDFhGviVB5LIJjPfNiTTOOgLaoks5rqzNXgaJpZM4Mt2_s .
Thanks Kev , I have tried that as well to write into stream but it shows same exception
var bytes = default(byte[]); using (var memstream = new MemoryStream()) {
bytes = blob.OpenRead().ToByteArray();
var textExtractors = new TextExtractor();
var Contentss = textExtractors.Extract(bytes);
Another issue is when you use Nuget package in Azure functions to import TikaOnDotNet it complains about mismatch in assembly . The exception i posted in earlier thread
Are you trying to extract text from the Azure blob or store the results a web URL (html) into a blob?
I would double check that the bytes you are putting into Tika are what you are expecting. More details on what you are attempting would be helpful.
Ah I see the same output when I am manually downloading the page. I seems that our ContentHandler
events are not getting all the contents of the page.
Duplicate of #84. Moving the conversation about this problem there.
Hi , I am writing HTML formated email to Azure blob and trying to read the blob content by passing the uri but TikaOnDotNet extractor is returning nothing