ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
585 stars 52 forks source link

bug: ensure tika extracts embedded doc with a stream #1165

Open pirhoo opened 1 year ago

pirhoo commented 1 year ago

When an embedded file content is requested by a user, instead of extracting content in memory, we will generate a file on disk. It will create a cache that will be used by datashare to stream the content of embedded files.

We will add :

    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/raw <--- the file binary content
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/raw.json <--- the file metadata
    # and maybe later it could be extended with
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/text/1.png <-- first page text
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/text/2.png <-- second page text
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/thumbnail/1.png <-- page 1 thumbnail
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/thumbnail/2.png <-- page 2 thumbnail

For example:

/configured/path/to/artifacts/12/34/1234f5fc76b8e243c8b0ae42cbee55afd3b0c0ffe67d31a5a8f2a9b13f2998e8/raw
/configured/path/to/artifacts/12/34/1234f5fc76b8e243c8b0ae42cbee55afd3b0c0ffe67d31a5a8f2a9b13f2998e8/raw.json
# ...

If the option artifactDir is not provided, datashare will use memory as before.

part of #1397

bamthomas commented 1 year ago

the result of extraction is stored into TikaDocumentSource class :

public class TikaDocumentSource {
    public final Metadata metadata;
    public final byte[] content;

    public TikaDocumentSource(final Metadata metadata, final byte[] content) {
        this.metadata = metadata;
        this.content = content;
    }
}

And is is used in SourceExtractor class.

The content should not be stored as an array of byte but as an InputStream (or subclass).

mvanzalu commented 1 year ago

No solution for disk issue for the moment