bottomless-archive-project / library-of-alexandria

Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
MIT License
110 stars 2 forks source link

Indexing fails on strange documents #19

Closed laxika closed 5 years ago

laxika commented 5 years ago

Indexing fails on password encoded documents or ones that have a password.

These instances shouldn't cause an exception/stack message!

org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=exception, reason=java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: EncryptedDocumentException[Unable to process: document is encrypted]; nested: InvalidPasswordException[Cannot decrypt PDF, the password is incorrect];]
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1706) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1683) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1600) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:580) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:317) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:301) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) ~[httpcore-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
    Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [http://localhost:9200], URI [/vault_documents/_doc/bdff0a14924d7a02d77870ca07254456a5cac113774937b76d21c33771758266?pipeline=attachment&timeout=1m], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"exception","reason":"java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: EncryptedDocumentException[Unable to process: document is encrypted]; nested: InvalidPasswordException[Cannot decrypt PDF, the password is incorrect];","header":{"processor_type":"attachment"}}],"type":"exception","reason":"java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: EncryptedDocumentException[Unable to process: document is encrypted]; nested: InvalidPasswordException[Cannot decrypt PDF, the password is incorrect];","caused_by":{"type":"illegal_argument_exception","reason":"ElasticsearchParseException[Error parsing document in field [content]]; nested: EncryptedDocumentException[Unable to process: document is encrypted]; nested: InvalidPasswordException[Cannot decrypt PDF, the password is incorrect];","caused_by":{"type":"parse_exception","reason":"Error parsing document in field [content]","caused_by":{"type":"encrypted_document_exception","reason":"Unable to process: document is encrypted","caused_by":{"type":"invalid_password_exception","reason":"Cannot decrypt PDF, the password is incorrect"}}}},"header":{"processor_type":"attachment"}},"status":500}
        at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:260) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        at org.elasticsearch.client.RestClient.access$900(RestClient.java:95) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:305) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        ... 16 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=ElasticsearchParseException[Error parsing document in field [content]]; nested: EncryptedDocumentException[Unable to process: document is encrypted]; nested: InvalidPasswordException[Cannot decrypt PDF, the password is incorrect];]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:598) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 21 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=parse_exception, reason=Error parsing document in field [content]]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 25 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=encrypted_document_exception, reason=Unable to process: document is encrypted]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 27 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=invalid_password_exception, reason=Cannot decrypt PDF, the password is incorrect]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 29 common frames omitted
org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=exception, reason=java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]; nested: IllegalArgumentException[root cannot be null];]
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1706) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1683) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1600) ~[elasticsearch-rest-high-level-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:580) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:317) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:301) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
    at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) ~[httpcore-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar:4.1.4]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.11.jar:4.4.11]
    at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
    Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [http://localhost:9200], URI [/vault_documents/_doc/6875b5fbfc6d2f33244389b3c030b7f42685199d2b819c974a4a2c471c9cfc2e?pipeline=attachment&timeout=1m], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"exception","reason":"java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]; nested: IllegalArgumentException[root cannot be null];","header":{"processor_type":"attachment"}}],"type":"exception","reason":"java.lang.IllegalArgumentException: ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]; nested: IllegalArgumentException[root cannot be null];","caused_by":{"type":"illegal_argument_exception","reason":"ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]; nested: IllegalArgumentException[root cannot be null];","caused_by":{"type":"parse_exception","reason":"Error parsing document in field [content]","caused_by":{"type":"tika_exception","reason":"Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609","caused_by":{"type":"illegal_argument_exception","reason":"root cannot be null"}}}},"header":{"processor_type":"attachment"}},"status":500}
        at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:260) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        at org.elasticsearch.client.RestClient.access$900(RestClient.java:95) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:305) ~[elasticsearch-rest-client-7.0.1.jar:7.0.1]
        ... 16 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=ElasticsearchParseException[Error parsing document in field [content]]; nested: TikaException[Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]; nested: IllegalArgumentException[root cannot be null];]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:598) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 21 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=parse_exception, reason=Error parsing document in field [content]]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 25 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=tika_exception, reason=Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@57277609]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 27 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=root cannot be null]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:491) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:402) ~[elasticsearch-7.0.1.jar:7.0.1]
    at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:432) ~[elasticsearch-7.0.1.jar:7.0.1]
    ... 29 common frames omitted
laxika commented 5 years ago

0f3aa188bd1db4837062cdf468bf456d9e8db14dabf82bc46ee0ca07efd63915.pdf

laxika commented 5 years ago

Can't really do anything about this except that we use less verbose logging.