Closed tarekziade closed 1 year ago
Error triggered by make e2e
2023/06/19 17:22:40 [error] 166#166: *3 failed to run body_filter_by_lua*: /app/lua/tika-response.lua:16: Expected value but found invalid token at character 1
stack traceback:
>...[C]: in function 'decode'
>.../app/lua/tika-response.lua:16: in main chunk while sending to client, client: 172.17.0.1, server: localhost, request: "PUT /extract_local_file_text/?local_file_path=sample.pdf HTTP/1.1", upstream: "http://127.0.0.1:9998/rmeta/text", host: "localhost:8090"
Looking at https://github.com/elastic/data-extraction-service/blob/main/nginx/lua/tika-response.lua#L16 I think it makes the assumption there's a JSON body, even empty.
INFO [qtp478139423-14] 18:03:32,444 org.apache.tika.server.core.TikaLoggingFilter Request URI: http://localhost:9998/rmeta/text
Jun 19, 2023 6:03:32 PM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 405 Method Not Allowed
>...at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
>...at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:168)
>...at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:673)
>...at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:182)
>...at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:79)
>...at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>...at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>...at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>...at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>...at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>...at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>...at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>...at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>...at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>...at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>...at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>...at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>...at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>...at org.eclipse.jetty.server.Server.handle(Server.java:516)
>...at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>...at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>...at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>...at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>...at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>...at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>...at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:137)
>...at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>...at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>...at java.lang.Thread.run(Thread.java:750)
I guess this call is wrong: https://github.com/elastic/connectors-python/blob/main/connectors/utils.py#L732 (we want PUT)
It's cleaner now, but I am not sure how to surface the error in Tika itself. It does a 500 but I am struggling to understand how to surface it in log4j
172.17.0.1 - - [19/Jun/2023:18:20:32 +0000] "GET / HTTP/1.1" 200 52 "-" "Python/3.11 aiohttp/3.8.4"
172.17.0.1 - - [19/Jun/2023:18:20:32 +0000] "PUT /extract_local_file_text/ HTTP/1.1" 500 101 "-" "Python/3.11 aiohttp/3.8.4"
2023/06/19 18:23:31 [] 166#166: *3 [lua] tika-response.lua:22: Unexpected response format from Tika server. while sending to client, client: 172.17.0.1, server: localhost, request: "PUT /extract_local_file_text/ HTTP/1.1", upstream: "http://127.0.0.1:9998/rmeta/text", host: "localhost:8090"
By going single process, I found some the problem:
We need to find a way to log the forked processes in Tika to catch those
DEBUG [main] 18:32:27,304 org.apache.tika.parser.external.ExternalParser exception trying to run ffmpeg
java.io.IOException: Cannot run program "ffmpeg": error=2, No such file or directory
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:593) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:458) ~[?:1.8.0_372]
>...at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:67) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:60) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:60) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.CompositeExternalParser.<init>(CompositeExternalParser.java:42) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.CompositeExternalParser.<init>(CompositeExternalParser.java:37) [tika-server-standard-2.8.0.jar:2.8.0]
>...at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_372]
>...at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [?:1.8.0_372]
>...at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [?:1.8.0_372]
>...at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [?:1.8.0_372]
>...at java.lang.Class.newInstance(Class.java:442) [?:1.8.0_372]
>...at org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:105) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:66) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.getDefaultParser(TikaConfig.java:301) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:243) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:358) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.Tika.<init>(Tika.java:119) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerProcess.main(TikaServerProcess.java:126) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.noFork(TikaServerCli.java:186) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:90) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
Caused by: java.io.IOException: error=2, No such file or directory
>...at java.lang.UNIXProcess.forkAndExec(Native Method) ~[?:1.8.0_372]
>...at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) ~[?:1.8.0_372]
>...at java.lang.ProcessImpl.start(ProcessImpl.java:134) ~[?:1.8.0_372]
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ~[?:1.8.0_372]
>...... 32 more
DEBUG [main] 18:32:27,327 org.apache.tika.parser.external.ExternalParser exception trying to run exiftool
java.io.IOException: Cannot run program "exiftool": error=2, No such file or directory
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:593) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:458) ~[?:1.8.0_372]
>...at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110) [tika-server-standard-2.8.0.jar:2.8.0]
Now we're correctly hitting the tika service, but looks like with the current Nginx rewrite, we don't set fetchKey
and the body with local file is still in the request sent to Tika
DEBUG [qtp34073107-40] 18:51:09,763 org.eclipse.jetty.server.HttpChannel onContent HttpChannelOverHttp@31f2a874{s=HttpChannelState@3e05ea62{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=HANDLING,uri=//localhost:9998/rmeta/text,age=10} Content@62401446{HeapByteBufferR@65f5b8e1[p=236,l=274,c=8192,r=38]={PUT /rmet...fsf\r\n\r\n<<<{"local_file_path": "/tmp/sample.pdf"}>>>\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00}}
I get back the result now, but Tika decides to use the text parser instead of the PDF. there must be some tweaking to do in the tika config to enable pdfbox
This PR is getting quite large, so I want to merge it now and add any further e2e testing capabilities in follow-up PRs.
~Draft until I get it to pass the new CI step.~
Getting CI to work will involve a lot more work than I expected. For now, I've just added "run make e2e
" as a PR requirement.
Do you want to integrate these tests into buildkite? Or should we do that separately?
@seanstory I want to very much, but when I attempted to do this it became a yak shave. I think it would be better as a separate PR.
I'm opting for a follow-up to fix the body response, because this is big and I want to merge
Adding some e2e tests support