Closed bjorn-ali-goransson closed 7 years ago
@bjorn-ali-goransson are you trying to extract from Tika directly to an output stream to avoid memory usage? The reason why I've used memory in the past is that we were turning around and feeding the text Lucene. I could see a usage where you stream it directly up to Solr or to a file on disk.
For example, when using Azure Blobs, all you get is a stream - so storing it in an array first just to feed it into Tika would be the taking long way around.
2016-12-15 2:08 GMT+01:00 Kevin Miller notifications@github.com:
@bjorn-ali-goransson https://github.com/bjorn-ali-goransson are you trying to extract from Tika directly to an output stream to avoid memory usage? The reason why I've used memory in the past is that we were turning around and feeding the text Lucene. I could see a usage where you stream it directly up to Solr or to a file on disk.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KevM/tikaondotnet/issues/65#issuecomment-267206870, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoyAP9bO2LC3GOQhMNgNlKKqH6dIrS0ks5rIJMjgaJpZM4KWd60 .
Oh good. Thanks for the use case. I think #74 might solve your concern. Please do take a look. I could use feedback to see if this addresses your concern.
All done and merged.
Hello again.
Is there much to gain by using this method, instead of feeding Tika a simple byte array?
public TextExtractionResult Extract(Func<Metadata, InputStream> streamFactory);
If so, how do we use it?
Right now, my code looks like this: