Open pirhoo opened 1 year ago
the result of extraction is stored into TikaDocumentSource class :
public class TikaDocumentSource {
public final Metadata metadata;
public final byte[] content;
public TikaDocumentSource(final Metadata metadata, final byte[] content) {
this.metadata = metadata;
this.content = content;
}
}
And is is used in SourceExtractor class.
The content should not be stored as an array of byte but as an InputStream (or subclass).
No solution for disk issue for the moment
When an embedded file content is requested by a user, instead of extracting content in memory, we will generate a file on disk. It will create a cache that will be used by datashare to stream the content of embedded files.
We will add :
artifact-dir
where the cache will be storedraw
and document metadata with the following structure:For example:
If the option
artifactDir
is not provided, datashare will use memory as before.part of #1397