elastic / data-extraction-service

Other
12 stars 0 forks source link

minimal e2e test coverage #12

Closed tarekziade closed 1 year ago

tarekziade commented 1 year ago

Adding some e2e tests support

tarekziade commented 1 year ago

Error triggered by make e2e

2023/06/19 17:22:40 [error] 166#166: *3 failed to run body_filter_by_lua*: /app/lua/tika-response.lua:16: Expected value but found invalid token at character 1
stack traceback:
>...[C]: in function 'decode'
>.../app/lua/tika-response.lua:16: in main chunk while sending to client, client: 172.17.0.1, server: localhost, request: "PUT /extract_local_file_text/?local_file_path=sample.pdf HTTP/1.1", upstream: "http://127.0.0.1:9998/rmeta/text", host: "localhost:8090"

Looking at https://github.com/elastic/data-extraction-service/blob/main/nginx/lua/tika-response.lua#L16 I think it makes the assumption there's a JSON body, even empty.

tarekziade commented 1 year ago
INFO  [qtp478139423-14] 18:03:32,444 org.apache.tika.server.core.TikaLoggingFilter Request URI: http://localhost:9998/rmeta/text
Jun 19, 2023 6:03:32 PM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
WARNING: javax.ws.rs.ClientErrorException: HTTP 405 Method Not Allowed
>...at org.apache.cxf.jaxrs.utils.SpecExceptions.toHttpException(SpecExceptions.java:117)
>...at org.apache.cxf.jaxrs.utils.ExceptionUtils.toHttpException(ExceptionUtils.java:168)
>...at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:673)
>...at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:182)
>...at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:79)
>...at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>...at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>...at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265)
>...at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>...at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>...at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>...at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>...at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>...at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>...at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
>...at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>...at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
>...at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>...at org.eclipse.jetty.server.Server.handle(Server.java:516)
>...at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
>...at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
>...at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
>...at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
>...at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>...at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
>...at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
>...at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:137)
>...at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
>...at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
>...at java.lang.Thread.run(Thread.java:750)
tarekziade commented 1 year ago

I guess this call is wrong: https://github.com/elastic/connectors-python/blob/main/connectors/utils.py#L732 (we want PUT)

tarekziade commented 1 year ago

It's cleaner now, but I am not sure how to surface the error in Tika itself. It does a 500 but I am struggling to understand how to surface it in log4j

172.17.0.1 - - [19/Jun/2023:18:20:32 +0000] "GET / HTTP/1.1" 200 52 "-" "Python/3.11 aiohttp/3.8.4"
172.17.0.1 - - [19/Jun/2023:18:20:32 +0000] "PUT /extract_local_file_text/ HTTP/1.1" 500 101 "-" "Python/3.11 aiohttp/3.8.4"
2023/06/19 18:23:31 [] 166#166: *3 [lua] tika-response.lua:22: Unexpected response format from Tika server. while sending to client, client: 172.17.0.1, server: localhost, request: "PUT /extract_local_file_text/ HTTP/1.1", upstream: "http://127.0.0.1:9998/rmeta/text", host: "localhost:8090"
tarekziade commented 1 year ago

By going single process, I found some the problem:

We need to find a way to log the forked processes in Tika to catch those

DEBUG [main] 18:32:27,304 org.apache.tika.parser.external.ExternalParser exception trying to run  ffmpeg
java.io.IOException: Cannot run program "ffmpeg": error=2, No such file or directory
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:593) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:458) ~[?:1.8.0_372]
>...at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:67) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:60) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:60) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.CompositeExternalParser.<init>(CompositeExternalParser.java:42) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.CompositeExternalParser.<init>(CompositeExternalParser.java:37) [tika-server-standard-2.8.0.jar:2.8.0]
>...at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_372]
>...at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [?:1.8.0_372]
>...at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [?:1.8.0_372]
>...at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [?:1.8.0_372]
>...at java.lang.Class.newInstance(Class.java:442) [?:1.8.0_372]
>...at org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:105) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:66) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.getDefaultParser(TikaConfig.java:301) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:243) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:358) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.Tika.<init>(Tika.java:119) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerProcess.main(TikaServerProcess.java:126) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.noFork(TikaServerCli.java:186) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:90) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:80) [tika-server-standard-2.8.0.jar:2.8.0]
Caused by: java.io.IOException: error=2, No such file or directory
>...at java.lang.UNIXProcess.forkAndExec(Native Method) ~[?:1.8.0_372]
>...at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) ~[?:1.8.0_372]
>...at java.lang.ProcessImpl.start(ProcessImpl.java:134) ~[?:1.8.0_372]
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ~[?:1.8.0_372]
>...... 32 more

DEBUG [main] 18:32:27,327 org.apache.tika.parser.external.ExternalParser exception trying to run  exiftool
java.io.IOException: Cannot run program "exiftool": error=2, No such file or directory
>...at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:593) ~[?:1.8.0_372]
>...at java.lang.Runtime.exec(Runtime.java:458) ~[?:1.8.0_372]
>...at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203) [tika-server-standard-2.8.0.jar:2.8.0]
>...at org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110) [tika-server-standard-2.8.0.jar:2.8.0]
tarekziade commented 1 year ago

Now we're correctly hitting the tika service, but looks like with the current Nginx rewrite, we don't set fetchKey and the body with local file is still in the request sent to Tika

DEBUG [qtp34073107-40] 18:51:09,763 org.eclipse.jetty.server.HttpChannel onContent HttpChannelOverHttp@31f2a874{s=HttpChannelState@3e05ea62{s=HANDLING rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=HANDLING,uri=//localhost:9998/rmeta/text,age=10} Content@62401446{HeapByteBufferR@65f5b8e1[p=236,l=274,c=8192,r=38]={PUT /rmet...fsf\r\n\r\n<<<{"local_file_path": "/tmp/sample.pdf"}>>>\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00}}
tarekziade commented 1 year ago

I get back the result now, but Tika decides to use the text parser instead of the PDF. there must be some tweaking to do in the tika config to enable pdfbox

navarone-feekery commented 1 year ago

This PR is getting quite large, so I want to merge it now and add any further e2e testing capabilities in follow-up PRs.

navarone-feekery commented 1 year ago

~Draft until I get it to pass the new CI step.~ Getting CI to work will involve a lot more work than I expected. For now, I've just added "run make e2e" as a PR requirement.

navarone-feekery commented 1 year ago

Do you want to integrate these tests into buildkite? Or should we do that separately?

@seanstory I want to very much, but when I attempted to do this it became a yak shave. I think it would be better as a separate PR.

navarone-feekery commented 1 year ago

I'm opting for a follow-up to fix the body response, because this is big and I want to merge