ChronixDB / chronix.server

The Chronix Server implementation that is based on Apache Solr.
Apache License 2.0
263 stars 29 forks source link

NullPointerException in CF CQL Parsing for parallel requests #151

Closed alexnavis closed 4 years ago

alexnavis commented 4 years ago

Hi Florian,

We see NPE issue in our deployment on few http requests. So I set-up a test(JMH) with 10 threads to simulate parallel queries. It seems the CQL parsing has some problem in concurrency from the initial look.

2 NPE Issues:

I build the chronix.server project with additional logs for the AnalysisHandler (with Handling analysis params). I think the problem is in the CQL cql = new CQL(TYPES, FUNCTIONS), may be some context is not thread-safe.

1) NPE

2019-08-18 05:19:15.023 INFO  (qtp1888442711-51) [   x:chronix] d.q.c.s.q.a.AnalysisHandler Handling analysis request {q=-end:[*+TO+1565684399999]+AND+-start:[1565699099999+TO+*]++AND+mdefId:(10020)&cf=metric{avg}&fl=mdefId,data,type,action,end,id,start,_version_,name&start=0&rows=2000&wt=javabin&version=2&query_start_long=1565684400000&query_end_long=1565699100000}

2019-08-18 05:19:15.023 INFO  (qtp1888442711-51) [   x:chronix] d.q.c.s.q.a.AnalysisHandler **Handling analysis params thread(51)** - q=-end:[*+TO+1565684399999]+AND+-start:[1565699099999+TO+*]++AND+mdefId:(10020)&cf=metric{avg}&fl=mdefId,data,type,action,end,id,start,_version_,name&start=0&rows=2000&wt=javabin&version=2&query_start_long=1565684400000&query_end_long=1565699100000

2019-08-18 05:19:15.025 ERROR (qtp1888442711-51) [   x:chronix] o.a.s.h.RequestHandlerBase java.lang.NullPointerException
    at org.antlr.v4.runtime.Parser.exitRule(Parser.java:643)
    at de.qaware.chronix.cql.antlr.CQLCFParser.cqlcf(CQLCFParser.java:151)
    at de.qaware.chronix.cql.CQL.parseCF(CQL.java:80)
    at de.qaware.chronix.solr.query.analysis.AnalysisHandler.handleRequestBody(AnalysisHandler.java:232)
    at de.qaware.chronix.solr.query.ChronixQueryHandler.handleRequestBody(ChronixQueryHandler.java:112)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2566)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:756)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:542)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:397)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
    at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)

2) NPE

2019-08-18 05:19:15.029 INFO  (qtp1888442711-48) [   x:chronix] d.q.c.s.q.a.AnalysisHandler Handling analysis request {q=-end:[*+TO+1565684399999]+AND+-start:[1565699099999+TO+*]++AND+mdefId:(10020)&cf=metric{avg}&fl=mdefId,data,type,action,end,id,start,_version_,name&start=0&rows=2000&wt=javabin&version=2&query_start_long=1565684400000&query_end_long=1565699100000}

2019-08-18 05:19:15.029 INFO  (qtp1888442711-48) [   x:chronix] d.q.c.s.q.a.AnalysisHandler **Handling analysis params thread(48)** - q=-end:[*+TO+1565684399999]+AND+-start:[1565699099999+TO+*]++AND+mdefId:(10020)&cf=metric{avg}&fl=mdefId,data,type,action,end,id,start,_version_,name&start=0&rows=2000&wt=javabin&version=2&query_start_long=1565684400000&query_end_long=1565699100000

2019-08-18 05:19:15.030 ERROR (qtp1888442711-48) [   x:chronix] o.a.s.h.RequestHandlerBase java.lang.NullPointerException
    at org.antlr.v4.runtime.Lexer.nextToken(Lexer.java:172)
    at org.antlr.v4.runtime.UnbufferedTokenStream.fill(UnbufferedTokenStream.java:203)
    at org.antlr.v4.runtime.UnbufferedTokenStream.<init>(UnbufferedTokenStream.java:99)
    at org.antlr.v4.runtime.UnbufferedTokenStream.<init>(UnbufferedTokenStream.java:92)
    at de.qaware.chronix.cql.CQL.init(CQL.java:95)
    at de.qaware.chronix.cql.CQL.parseCF(CQL.java:78)
    at de.qaware.chronix.solr.query.analysis.AnalysisHandler.handleRequestBody(AnalysisHandler.java:232)
    at de.qaware.chronix.solr.query.ChronixQueryHandler.handleRequestBody(ChronixQueryHandler.java:112)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2566)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:756)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:542)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:397)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
    at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)

Data: No data in Chronix db. Code: (run with 10 threads (jmh or any setup)).

String q = "start:2019-08-13T08:20:00.000Z AND end:2019-08-13T12:25:00.000Z AND mdefId:(10020)"; ChronixClient<MetricTimeSeries, SolrClient, SolrQuery> chronixClient = getChronixClient(200); HashMap<String, String> map = new HashMap<String, String>() {{ put("cf", "metric{avg}"); put("fl", "+data"); }}; SolrQuery solrQuery = new SolrQuery(q); for (Map.Entry<String, String> e : params.entrySet()) { solrQuery.setParam(e.getKey(), e.getValue()); } solrQuery.setRows(rows); List records = chronixClient.stream(solr, solrQuery).collect(Collectors.toList());

Please help in this regard.

Thanks, Alex

FlorianLautenschlager commented 4 years ago

Hi @alexnavis,

thank you for reporting this issue. Today i do not have the time to dig into it. I will give you an update by tomorrow.

Best Regards Florian

alexnavis commented 4 years ago

It would be great. Thanks @FlorianLautenschlager.

alexnavis commented 4 years ago

I tried a few things to fix it, but it didn't work out as I'm pretty much new to antlr. :)

Some of the observations might be wrong, hope it might be helpful for you. 1) CQLCFParser and CQLCFLexer has a static PredictionContextCache _sharedContextCache, which seems to be not thread-safe. 2) Tried updating the latest ANTLR 4.7.2 and uncommented the code in CQL (line no: 90).

One more test which kind of look like it solved the issue, moving cql instance variable(AnalysisHandler) to local method scope(handleRequestBody) seemed like solving the problem, but it come very slow. private final CQL cql = new CQL(TYPES, FUNCTIONS);

Moved Thanks, Alex

shridharV commented 4 years ago

Hi @alexnavis,

thank you for reporting this issue. Today i do not have the time to dig into it. I will give you an update by tomorrow.

Best Regards Florian

Thanks Florian,

FlorianLautenschlager commented 4 years ago

Hi @alexnavis , as you already pointed out, the CQL class is not thread-safe. After some JMH benchmarks I decided to create a new instance of the CQL class for every request to avoid implementing synchronization features. I will release a fixed version soon.

alexnavis commented 4 years ago

Hi @FlorianLautenschlager ,

Thanks a lot for the help. I changed the fix in a different way to avoid unwanted instance creation, please do have a look PR: https://github.com/ChronixDB/chronix.server/pull/152

Also exitRule NPE was because of another exception(token exception in lexer) which was hidden by the NPE. Added logs for the same. If you feel it is not required, i can revert back the log changes.

Thanks, Alex

FlorianLautenschlager commented 4 years ago

Hi @alexnavis,

If you feel it is not required, i can revert back the log changes.

There's that one thing I don't like: You modified the generated code of Antlr. From a performance perspective it is quite the same if we create an instance for every request or reuse an object and create the instances within the called method. As soon as you have reverted your changes I will merge it into the master an release it.

Best Regards Florian

alexnavis commented 4 years ago

Hi @FlorianLautenschlager ,

I missed that it is a generated code and I agree with you. I reverted the logs and static removal for PredictionContextCache.

Rest of the changes are to avoid to unwanted creation of stream for each request with CF and code is also cleaner. Hope you are fine with rest of the changes.

Thanks, Alex

FlorianLautenschlager commented 4 years ago

Hi @alexnavis,

thanks for your effort =) I've taken some but not all of your suggestions. It should be clearer now. Tests were successful and hence I will release a fixed version 0.6.1.

alexnavis commented 4 years ago

Hi @FlorianLautenschlager ,

Sure thanks for the build. But one thing, if you have decided to merge your changes you could have informed me to close my PR. It could have saved my time to avoid revert those log commits, because I was thinking you are fine with the other fixes that is cleaner in not creating unnecessary lexer/parser(new CQL) instances for chronix queries with no CF's involved.

Thanks for the help anyways.

Thanks, Alex

shridharV commented 4 years ago

Hi @FlorianLautenschlager, https://github.com/FlorianLautenschlager @alexnavis https://github.com/alexnavis

Thanks for solving this issue.

Regards Shridhar

On Sat, Aug 24, 2019 at 10:52 PM Alex Anto Navis L notifications@github.com wrote:

Hi @FlorianLautenschlager https://github.com/FlorianLautenschlager ,

Sure thanks for the build. But one thing, if you have decided to merge your changes you could have informed me to close my PR. It could have saved my time to avoid revert those log commits, because I was thinking you are fine with the other fixes that is cleaner in not creating unnecessary lexer/parser(new CQL) instances for chronix queries with no CF's involved.

Thanks for the help anyways.

Thanks, Alex

Thanks, Alex

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ChronixDB/chronix.server/issues/151?email_source=notifications&email_token=AFB32ORA6C5KDJH6EYJALY3QGIM3PA5CNFSM4IMROMEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CM3QQ#issuecomment-524602818, or mute the thread https://github.com/notifications/unsubscribe-auth/AFB32OQMJH42KYHGG6K3VITQGIM3PANCNFSM4IMROMEA .