KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform
http://kevm.github.io/tikaondotnet/
Apache License 2.0
197 stars 72 forks source link

Unable to get the Content of HTML email component stored in Blob #92

Closed abhi004 closed 7 years ago

abhi004 commented 7 years ago

Hi , I am writing HTML formated email to Azure blob and trying to read the blob content by passing the uri but TikaOnDotNet extractor is returning nothing

KevM commented 7 years ago

I confirmed this and am working on a fix. Not sure why the native Tika mechanism is working weirdly.

It is easy to do this your self. Just download the bytes of the web page and call the Extract(byte[]) overload. On Wed, Mar 29, 2017 at 11:45 PM abhi004 notifications@github.com wrote:

Hi , I am writing HTML formated email to Azure blob and trying to read the blob content by passing the uri but TikaOnDotNet extractor is returning nothing

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/92, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGHdUDFhGviVB5LIJjPfNiTTOOgLaoks5rqzNXgaJpZM4Mt2_s .

abhi004 commented 7 years ago

Thanks Kev , I have tried that as well to write into stream but it shows same exception

var bytes = default(byte[]); using (var memstream = new MemoryStream()) {

                    bytes = blob.OpenRead().ToByteArray();
                    var textExtractors = new TextExtractor();
                    var Contentss = textExtractors.Extract(bytes);

Another issue is when you use Nuget package in Azure functions to import TikaOnDotNet it complains about mismatch in assembly . The exception i posted in earlier thread

KevM commented 7 years ago

Are you trying to extract text from the Azure blob or store the results a web URL (html) into a blob?

I would double check that the bytes you are putting into Tika are what you are expecting. More details on what you are attempting would be helpful.

KevM commented 7 years ago

Ah I see the same output when I am manually downloading the page. I seems that our ContentHandler events are not getting all the contents of the page.

KevM commented 7 years ago

Duplicate of #84. Moving the conversation about this problem there.