Feeding Tika with a stream?

KevM / tikaondotnet

Use the Java Tika text extraction library on the .NET platform

http://kevm.github.io/tikaondotnet/

Apache License 2.0

195 stars 73 forks source link

Feeding Tika with a stream? #65

Closed bjorn-ali-goransson closed 7 years ago

bjorn-ali-goransson commented 7 years ago

Hello again.

Is there much to gain by using this method, instead of feeding Tika a simple byte array?

public TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory);

If so, how do we use it?

Right now, my code looks like this:

    public List<string> ExtractText(Stream inputStream)
    {
        using (var memoryStream = new MemoryStream())
        {
            inputStream.CopyTo(memoryStream);

            var result = Tika.Extract(memoryStream.GetBuffer());

            var str = result.Text
                .Replace("\r", string.Empty)
                .Replace("§  ", string.Empty)
                .Split(new string[] { "\n\n\n\n" }, StringSplitOptions.RemoveEmptyEntries)
                .Select(t => t.Replace("\n", " ").Replace("    ", " ").Replace("   ", " ").Replace("  ", " ").Trim())
                .ToList();

            return str;
        }
    }

KevM commented 7 years ago

@bjorn-ali-goransson are you trying to extract from Tika directly to an output stream to avoid memory usage? The reason why I've used memory in the past is that we were turning around and feeding the text Lucene. I could see a usage where you stream it directly up to Solr or to a file on disk.

bjorn-ali-goransson commented 7 years ago

For example, when using Azure Blobs, all you get is a stream - so storing it in an array first just to feed it into Tika would be the taking long way around.

2016-12-15 2:08 GMT+01:00 Kevin Miller notifications@github.com:

@bjorn-ali-goransson https://github.com/bjorn-ali-goransson are you trying to extract from Tika directly to an output stream to avoid memory usage? The reason why I've used memory in the past is that we were turning around and feeding the text Lucene. I could see a usage where you stream it directly up to Solr or to a file on disk.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/65#issuecomment-267206870, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoyAP9bO2LC3GOQhMNgNlKKqH6dIrS0ks5rIJMjgaJpZM4KWd60 .

KevM commented 7 years ago

Oh good. Thanks for the use case. I think #74 might solve your concern. Please do take a look. I could use feedback to see if this addresses your concern.

KevM commented 7 years ago

All done and merged.