apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.1k stars 647 forks source link

Poor performance when parsing huge literal in query (e.g. 100MB) #1324

Open SimonBin opened 2 years ago

SimonBin commented 2 years ago

The cause seems to be https://github.com/javacc/javacc/issues/72

We encountered this issue when a SPARQL SERVICE clause was sending a large-ish Geometry literal of USA to Fuseki. It stalls forever trying to parse the query.

Ideally, the buffer would expand exponentially or there is an alternative PR linked in the javacc issue. Currently, the parsing buffer is apparently grown in steps of 2KiB

jstack: ``` "qtp771666241-136" #136 prio=5 os_prio=0 cpu=13385,35ms elapsed=5538,68s tid=0x00007fa188007800 nid=0x15730d runnable [0x00007fa1341f9000] java.lang.Thread.State: RUNNABLE at org.apache.jena.sparql.lang.arq.SimpleCharStream.ExpandBuff(SimpleCharStream.java:42) at org.apache.jena.sparql.lang.arq.SimpleCharStream.FillBuff(SimpleCharStream.java:103) at org.apache.jena.sparql.lang.arq.SimpleCharStream.readChar(SimpleCharStream.java:197) at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.jjMoveNfa_0(ARQParserTokenManager.java:4369) at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.jjMoveStringLiteralDfa0_0(ARQParserTokenManager.java:211) at org.apache.jena.sparql.lang.arq.ARQParserTokenManager.getNextToken(ARQParserTokenManager.java:4793) at org.apache.jena.sparql.lang.arq.ARQParser.jj_ntk_f(ARQParser.java:8162) at org.apache.jena.sparql.lang.arq.ARQParser.PathElt(ARQParser.java:3603) at org.apache.jena.sparql.lang.arq.ARQParser.PathEltOrInverse(ARQParser.java:3635) at org.apache.jena.sparql.lang.arq.ARQParser.PathSequence(ARQParser.java:3565) at org.apache.jena.sparql.lang.arq.ARQParser.PathAlternative(ARQParser.java:3544) at org.apache.jena.sparql.lang.arq.ARQParser.Path(ARQParser.java:3538) at org.apache.jena.sparql.lang.arq.ARQParser.VerbPath(ARQParser.java:3493) at org.apache.jena.sparql.lang.arq.ARQParser.PropertyListPathNotEmpty(ARQParser.java:3418) at org.apache.jena.sparql.lang.arq.ARQParser.TriplesSameSubjectPath(ARQParser.java:3365) at org.apache.jena.sparql.lang.arq.ARQParser.TriplesBlock(ARQParser.java:2512) at org.apache.jena.sparql.lang.arq.ARQParser.GroupGraphPatternSub(ARQParser.java:2425) at org.apache.jena.sparql.lang.arq.ARQParser.GroupGraphPattern(ARQParser.java:2387) at org.apache.jena.sparql.lang.arq.ARQParser.WhereClause(ARQParser.java:858) at org.apache.jena.sparql.lang.arq.ARQParser.SelectQuery(ARQParser.java:137) at org.apache.jena.sparql.lang.arq.ARQParser.Query(ARQParser.java:31) at org.apache.jena.sparql.lang.arq.ARQParser.QueryUnit(ARQParser.java:22) at org.apache.jena.sparql.lang.ParserARQ$1.exec(ParserARQ.java:48) at org.apache.jena.sparql.lang.ParserARQ.perform(ParserARQ.java:95) at org.apache.jena.sparql.lang.ParserARQ.parse$(ParserARQ.java:52) at org.apache.jena.sparql.lang.SPARQLParser.parse(SPARQLParser.java:33) at org.apache.jena.query.QueryFactory.parse(QueryFactory.java:144) at org.apache.jena.query.QueryFactory.create(QueryFactory.java:83) at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execute(SPARQLQueryProcessor.java:251) at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.executeBody(SPARQLQueryProcessor.java:234) at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execute(SPARQLQueryProcessor.java:213) at org.apache.jena.fuseki.servlets.ActionService.executeLifecycle(ActionService.java:58) at org.apache.jena.fuseki.servlets.SPARQLQueryProcessor.execPost(SPARQLQueryProcessor.java:83) at org.apache.jena.fuseki.servlets.ActionProcessor.process(ActionProcessor.java:34) at org.apache.jena.fuseki.servlets.ActionBase.process(ActionBase.java:55) at org.apache.jena.fuseki.servlets.ActionExecLib.execActionSub(ActionExecLib.java:125) at org.apache.jena.fuseki.servlets.ActionExecLib.execAction(ActionExecLib.java:99) at org.apache.jena.fuseki.server.Dispatcher.dispatchAction(Dispatcher.java:164) at org.apache.jena.fuseki.server.Dispatcher.process(Dispatcher.java:156) at org.apache.jena.fuseki.server.Dispatcher.dispatch(Dispatcher.java:83) at org.apache.jena.fuseki.servlets.FusekiFilter.doFilter(FusekiFilter.java:48) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600) at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:61) at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108) at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137) at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125) at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66) at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108) at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137) at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125) at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66) at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:450) at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365) at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90) at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83) at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:387) at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362) at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600) at org.apache.jena.fuseki.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:284) at org.apache.jena.fuseki.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:247) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1600) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:506) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1571) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1378) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:463) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1544) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1300) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:717) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) at org.eclipse.jetty.server.Server.handle(Server.java:562) at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505) at org.eclipse.jetty.server.HttpChannel$$Lambda$636/0x000000084084d040.dispatch(Unknown Source) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:319) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:412) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:381) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:268) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:138) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy$$Lambda$624/0x0000000840830c40.run(Unknown Source) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:407) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:894) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1038) at java.lang.Thread.run(java.base@11.0.15/Thread.java:829) ```

query is something simple as

{ ?c ^<http://www.opengis.net/ont/geosparql#sfContains> "<?xml version=\"1.0\" encoding=\"UTF-8\"?><gml:MultiSurface xmlns:gml=\"http://www.opengis.net/ont/gml\" gml:id=\"g2015_2014_0.104.wkb_geometry\" srsDimension=\"2\" srsName=\"urn:ogc:def:crs:EPSG::3857\"><gml:surfaceMember><gml:Polygon gml:id=\"g2015_2014_0.104.wkb_geometry.1\"><gml:exterior><gml:LinearRing><gml:posList>HUGE POS LIST</gml:posList></gml:LinearRing></gml:exterior></gml:Polygon></gml:surfaceMember></gml:MultiSurface>"^^<http://www.opengis.net/ont/geosparql#gmlLiteral> }

automatically injected from a service clause

afs commented 2 years ago

Reading the JavaCC issue, it seems the query size is not the problem.

It is the presence of large tokens over 2048 bytes.

@SimonBin Is that correct?

SimpleCharStream and JavaCharStream are not Jena code. They are generated by JavaCC. They are committed to the codebase so people building Jena do not need to install JavaCC themselves.

SimonBin commented 2 years ago

Yes I believe this is correct

afs commented 2 years ago

I don't see a PR on the javacc issue that is suitable.

There is an interesting suggestion about lexical states. ARQ only parses from strings, not streams, and only from data already already converted UTF-8. Access to the input would enable slicing literals direly out of the string.

Rather than disrupt the existing processing, it could be done with a new token e.g. X"....".

USER_CHAR_STREAM is also an option.

There is some investigation to do such as updating for Javacc 7.0 (the Jena codebase files were produced from JavaCC 6.0). #1328.

FYI: The different parsers use different techniques to handle unicode and it is in some tests about surrogate pairs.

SimonBin commented 2 years ago

I tried to regenerate the grammars with javacc/javacc#85 applied, as that sounds promising. (you can check the code here: https://github.com/SANSA-Stack/jena/commit/95c41d29c73167fa19da5ad0f501131f54ae3d58 ) but there seems to be a bug in javacc/javacc#85 which makes the parsing fail with ContentIllegalInProlog

afs commented 2 years ago

javacc/javacc#85 has not been integrated into javacc releases.

Jena running it's own fork of javacc is highly undesirable - it's technical debt. Users must be able to build Jena and while we ship the generated code in a Jena release so uses don't have to run javacc themselves (e.g. only old versions are available in Ubuntu app repos) they should be able to.

I'm not even sure javacc/javacc#85 is the right solution - it has access overhead. Maybe that's why ArrayList grows by 1.5 each step.

The case of 100Mb literals in a query is not mainstream :smile: . (Maybe search by SHA512?)

1328 (JavaCC 7.0 upgrade) is in Jena/main.

(Background: the main RIOT parsers do not use JavaCC.)

ContentIllegalInProlog is an XML error. Maybe it should not have the ?xml part. RDF XML Literals do not have the <?xml>. If you change your example to an rdf:XMLLiteral, there is a warning.

SimonBin commented 2 years ago

nb JavaCC 7 does not influence this performance

LorenzBuehmann commented 2 years ago

The case of 100Mb literals in a query is not mainstream smile . (Maybe search by SHA512?)

True. But to give some background (why SHA512 is not a workaround here): we're currently trying to make use of GeoSPARQL in our projects, we gathered boundaries/borders of administrative regions from an external dataset provided as GML (which in fact is XML) - for USA the polygon is ~100MB, we tried to deal with this initially, spotted the limitation in Jena, and of course in the meantime simplified the polygon directly via SPARQL query function call (with some custom Jena function unfortunately, GeoSPARQL standard lacks a lots of quite things people in my opinion would be happy to have - good thing, we can extend Jena for our work)

afs commented 2 years ago

Does quite explain why there is a 100Mb literal in a query but no matter.

SPARQL parsing is central so any changes need to be done carefully, and be proven and mature. Like javacc, I'm thinking about unforeseen consequences.

The parser to use is (surprise!) controlled by a registry SPARQLParserRegistry so extension code can change the parser for an experimental one.

Also --

The jaavcc issue suggests a different approach which is also more efficient - lexical states and lexical actions. The string for token image can be created directly without going through the javacc buffering.

LorenzBuehmann commented 2 years ago

Fair. But even if we cannot fix this - also given that it's more a corner case with literals in MB scale and beyond - we wanted to at least report this behavior - could be a good FAQ entry or the like. Luckily, in our case we own both Fuseki instances A (with the polygons) and B (with the points), thus, we indeed switched the direction, gathering the polygon via a SERVICE request from A and doing the point in polygon check in B compared to passing the polygon from A to B to make the point in polygon check via the SERVICE request.

afs commented 2 years ago

Yes, good to report.

It might be more of an issue in INSERT DATA although then there is usually the option of POSTing RDF.

Fuseki has "Fuseki modules" so adding a modified parser does not require a complete rebuild.

Or in this case the plain Jena initialization that can modify the server.

The lexical actions approach looks interesting.

new-javacc commented 2 years ago

Do you have an actual test case and the grammar somewhere I can look? I'm curious.

new-javacc commented 2 years ago

javacc/javacc#85 has not been integrated into javacc releases.

Jena running it's own fork of javacc is highly undesirable - it's technical debt. Users must be able to build Jena and while we ship the generated code in a Jena release so uses don't have to run javacc themselves (e.g. only old versions are available in Ubuntu app repos) they should be able to.

Not sure why you needed to fork. Let me know if there is something that I can do to help with that. If the issue is with CharStream, just set USER_CHAR_STREAM=true and write your own stream class that does better buffer management. But like I said most often it's grammar inefficiencies that manifest this way. And use ideas like lexical states to more elegantly handling huge tokens.

I'm not even sure javacc/javacc#85 is the right solution - it has access overhead. Maybe that's why ArrayList grows by 1.5 each step.

Not sure what this is. I don't think we merged into main I will check.

The case of 100Mb literals in a query is not mainstream smile . (Maybe search by SHA512?)

It's not unheard of. But if you expect this to happen even occasionally (as opposed to rarely), you might want to redo your rule for literals. Negation opreator does generate a lot of overhead in terms of state etc.

1328 (JavaCC 7.0 upgrade) is in Jena/main.

(Background: the main RIOT parsers do not use JavaCC.)

ContentIllegalInProlog is an XML error. Maybe it should not have the ?xml part. RDF XML Literals do not have the <?xml>. If you change your example to an rdf:XMLLiteral, there is a warning.

afs commented 2 years ago

The issue we experience is buffer management. Linear growth of a few Kbytes for 100Mb is a lot of recopying. If it grew at say x1.5 (like a Java ArrayList) the effect would be much less. (The same issues can arise with Arraylist but much less pronounced).

The grammar is the grammar in the SPARQL specification - it really "is the grammar" because the HTML in th spec was produced from this JavaCC grammar!

The negation is only 3 chars ahead maximum.

@SimonBin - is this triple-quoted literals or single-quoted?

In both cases, if the XML uses " for attributes, then a ' quoting may make a difference but I suspect the buffer expansion is going to dominate.

new-javacc commented 2 years ago

The main issue is this is a corner case and doesn't make sense to penalize the normal cases. So you have two options

  1. Edit the generated file and make it the way you want - javacc doesn't overwrite existing files. It's a "feature" if you will but probably looks like a hack to the modern developers :)
  2. Use the USER_CHARSTREAM option and implement your own charstream - based on the default one.

But like I originally said, it's best and more performant to use lexical states. For example if it's triple quoted and you don't allow triple quotes in the literal, you can use lexical states and do:

MORE: { "'''" : QUOTED_CONTENT } TOKEN : { : DEFAU:T } MORE : { <~[]> } That just keeps building your literal without actually affecting the chrastream. The performance difference could be huge here - like 10x even as it will start eating the chars one by one until it sees a ''' so no buffering in the stream required as there is no backtracking! And the image of the literal is built using the standard StringBuilder which is hopefully smarter implementation. So I suggest doing that option.
SimonBin commented 2 years ago

in this case it's a single doublequoted wktLiteral or gmlLiteral automatically injected by Jena into a SERVICE clause as outlined in my initial report

and the bug is JavaCC not defaulting to exponential buffer expansion

this deficiency is basically able to Denial-of-Service a Fuseki server just with a simple SPARQL query

new-javacc commented 2 years ago

Still the same idea works. Just change it to single quote! And if you have escape, add that also as a MORE rule. See examples in the repo

afs commented 2 years ago

@SimonBin The contents of the literal (XML use of " or ') is up to the application.

afs commented 2 years ago

@new-javacc In https://github.com/javacc/javacc/pull/85 the apraoch is a level of indirection but with the advantage that there is no contents copy.

The current reallocation strategy in javacc is char[] newbuffer = new char[bufsize + 2048];

Following the example of ArrayList.newCapacity:

   int newSize = bufsize == 2048 ? bufsize+2048 : (bufsize+(bufsize>>1));
   char[] newbuffer = new char[newSize];`

which does not have the redirection but does have a single continuous buffer (the backup concern on javacc/javacc#85) and the (current) cost of a copy on reallocation. It preserves the current usage experience but which grows faster beyond 6144 bytes (4096+2048) and does less xcopying at large scale.

new-javacc commented 2 years ago

Sure you can use that as a user charstream impl.

new-javacc commented 2 years ago

@SimonBin The contents of the literal (XML use of " or ') is up to the application.

Still fine. You need a rule to figure out when to stop the literal. The idea is the same for any token that's delimited by markers

SimonBin commented 2 years ago

@new-javacc why don't you fix this bug in javacc?

new-javacc commented 2 years ago

Because its not a bug :) this is a corner case as most tokens are average 3-4 chars long so the buffer actually should never need expansion. And this elegant way of doing demarked literals is more efficient.

afs commented 2 years ago

@new-javacc thank you for the suggestion for rewriting the grammar but it does not address the original report which is the buffer reallocation which is https://github.com/javacc/javacc/pull/85.

For a 1Mbyte buffer, the current JavaCC strategy does 267,911,168 bytes of copying (copy the buffer on every 2K growth).

Changing to a growth of 1.5, there are is 2,385,340 bytes of copying. It behaves the same as current JavaCC upto 4096.

As this would benefit more than just this project, I've added the possibility to https://github.com/javacc/javacc/pull/85.

@SimonBin - could you please get some VisualVM/YourKit performance hotspot figures to show the time that the code is in buffer allocation?

SimonBin commented 2 years ago

Here is an excerpt from the VisualVM where you can see that basically all the time is spent in ExpandBuff, 1 minute to just parse a mere 10MB

image

estonia.jfr

If I try to use this with Australia's border, which is only 3.5 times as big, the time requirement just to parse increases to 14 minutes running at 120% CPU all the time

We could never finish parsing the 100MB literal:

image

afs commented 2 years ago

Thank you - a grammar change may help but ExpandBuff reading more characters is dominant.

new-javacc commented 2 years ago

Just to clarify - the grammar change eliminates the expandbuf call!! I will revisit the charstream fix it I can see tests that it doesn't impact more common situations

afs commented 2 years ago

A variant which copes with the escapes may fix the problem - for this project only.

We can add it (the main.jj grammar is in fact two grammars controlled by cpp) - as long as the SPARQ 1.1 isn't touched.

new-javacc commented 2 years ago

Can you point the rule and the grammar file so I can test/validate it myself.

afs commented 2 years ago

https://github.com/apache/jena/blob/3210f8b6096b5e13bf4e1b71803c262dea1703c8/jena-arq/Grammar/main.jj#L2713

There is a lot of history here! It also has to align with the RDF data syntax Turtle.

SPARQL has 4 string forms: 2 single quoted (using either " or '), 2 triple-quoted, multiline (using either """ or ''').

The grammar is the ifdef's for ARQ.

| < STRING_LITERAL_LONG1:
     <QUOTE_3S> 
      ( ("'" | "''")? (~["'","\\"] | <ECHAR>  | <UCHAR> ))*
     <QUOTE_3S> >

where <ECHAR: "\\" ( "t"|"b"|"n"|"r"|"f"|"\\"|"\""|"'") > and UCHAR is \u and \U haxe escapes (done by JavaCC only in SPARQL 1.1 form, not the ARQ).

| < #UCHAR:      <UCHAR4> | <UCHAR8> >
| < #UCHAR4:     "\\" "u" <HEX> <HEX> <HEX> <HEX> >
| < #UCHAR8:     "\\" "U" <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> <HEX> >
new-javacc commented 2 years ago

So yeah the MORE pattern still keeps the whole thing in memory :( so I nornally just use SKIP and collect image myself into a buffer like the attached grammar which works well with the default and independent of the Charstream logic/buffering. Some of these idioms were developed in 1996 for Java 1.0 with 32MB RAM machines and mostly desktop apps lol so yeah time for updating them.

Anyway, the chunking charstream is not well tested for correctness or performance so until that happens maybe you can simply get that and use it as a USER_CHARSTREAM (unless you user my SKIP version).

TOKEN_MGR_DECLS:
{
  static StringBuilder sb = new StringBuilder();
}

SKIP:
{
   < STRING_LITERAL_BEGIN: "'''"> : LIT_BODY
}

<LIT_BODY> TOKEN:
{
     <STRING_LITERAL_LONG1: "'''"> { matchedToken.image = (sb.toString()); }: DEFAULT
}

<LIT_BODY> SKIP: {
 < ~[]> { sb.append(image); }
}
new-javacc commented 2 years ago

Also if your parser can receive the whole input as a string, you can just use a SimpleCharStream with bufffer size as the length of the string itself and instantiate it with a StringReader. Like:

SimpleCharStream simpleCharStream = new SimpleCharStream(new StringReader(input), input.length(), 1, 1)

Which makes sure it will never call ExpandBuf!

So if the parser is sitting in a service, make it STATIC=false and use one parser per request with this kind of consturctor so you don't need to worry about expand buf or memory management of the parser.