StackOverflowError with long Collections

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
import java.lang.reflect.Type;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;

import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;

public class TestGson
{
    private String name;
    private String value;

    public TestGson()
    {
    }

    /**
     * @param name
     * @param value
     */
    public TestGson(String name, String value)
    {
        super();
        this.name = name;
        this.value = value;
    }

    public static void main(String[] args)
    {
        List<TestGson> list = new ArrayList<TestGson>(10000);

        for (int x = 0; x < 10000;x++)
        {
            list.add(new TestGson("name"+x,"value"+x));
        }

        Gson gson = new Gson();

        String json = gson.toJson(list);
        System.out.println("Json: "+json);

        Type collectionType = new TypeToken<ArrayList<TestGson>>(){}.getType();

        List<TestGson> list2 = gson.fromJson(json,collectionType);

    }
}

What is the expected output? What do you see instead?

The stack trace looks as follows:

Exception in thread "main" com.google.gson.JsonParseException: Failed
parsing JSON source: java.io.StringReader@1b9ce4b to Json
    at com.google.gson.Gson.fromJson(Gson.java:380)
    at com.google.gson.Gson.fromJson(Gson.java:321)
    at TestGson.main(TestGson.java:48)
Caused by: java.lang.StackOverflowError
    at com.google.gson.SimpleCharStream.readChar(SimpleCharStream.java:198)
    at
com.google.gson.JsonParserTokenManager.jjMoveNfa_0(JsonParserTokenManager.java:5
84)
    at
com.google.gson.JsonParserTokenManager.jjStartNfaWithStates_0(JsonParserTokenMan
ager.java:165)
    at
com.google.gson.JsonParserTokenManager.jjMoveStringLiteralDfa0_0(JsonParserToken
Manager.java:172)
    at
com.google.gson.JsonParserTokenManager.getNextToken(JsonParserTokenManager.java:
935)
    at com.google.gson.JsonParser.jj_ntk(JsonParser.java:396)
    at com.google.gson.JsonParser.JsonString(JsonParser.java:274)
    at com.google.gson.JsonParser.Pair(JsonParser.java:76)
    at com.google.gson.JsonParser.Members(JsonParser.java:61)
    at com.google.gson.JsonParser.Members(JsonParser.java:65)
    at com.google.gson.JsonParser.JsonObject(JsonParser.java:42)
    at com.google.gson.JsonParser.JsonValue(JsonParser.java:134)
    at com.google.gson.JsonParser.Elements(JsonParser.java:109)
    at com.google.gson.JsonParser.Elements(JsonParser.java:113)
    at com.google.gson.JsonParser.Elements(JsonParser.java:113)
...

What version of the product are you using? On what operating system?
GSON 1.3

Please provide any additional information below.
It is strange that a linear collection is deserialized using recursion.
This will always fail with large collections. Sooner or later. With my
stack size the limit was something like 8500 elements.

Original issue reported on code.google.com by nit...@googlemail.com on 22 Jan 2009 at 9:40

GoogleCodeExporter commented 9 years ago

Added performance tests in r388 

On my machine (a dual-processor 64 bit ubuntu with 8GB RAM), Gson was able to
serialize a collection of 1.4 million objects. The deserializer threw a stack
overflow error in the parser beyond a collection of 87,000 objects.

Original comment by inder123 on 3 Mar 2009 at 10:23

GoogleCodeExporter commented 9 years ago

Original comment by inder123 on 3 Mar 2009 at 10:24

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

The parser uses a production to match a collection, and that is implemented by 
JavaCC
using recursion. If someone proposes a more efficient production, we will be 
happy to
incorporate it.

Original comment by inder123 on 3 Mar 2009 at 10:39

GoogleCodeExporter commented 9 years ago

One of the stated goals of GSON is to "Support arbitrarily complex objects 
(with deep 
inheritance hierarchies and extensive use of generic types)".

I guess you should change it to 'arbitrarily complex objects (except 
collections with 
large numbers of items)'.

I use a byte array to store the bytes of a file.  I can serialize this array w/ 
GSON 
just fine, but somehow this object is too complex to be parsed?  A byte array?  
Really?

Original comment by kenot...@gmail.com on 13 Jul 2009 at 8:33

GoogleCodeExporter commented 9 years ago

Can you give some more information on how large the byte array is?

Original comment by inder123 on 13 Jul 2009 at 8:38

GoogleCodeExporter commented 9 years ago

It's a megabyte-ish file, so about a million items.  The point though is that 
it 
shouldn't matter how many items are in it, it's perhaps the most basic data 
type in all 
of Java, and if GSON can't parse it, then perhaps GSON shouldn't be generating 
it in 
the first place.

Original comment by kenot...@gmail.com on 13 Jul 2009 at 8:48

GoogleCodeExporter commented 9 years ago

Well, that is not how parsers work unfortunately. They have to match a 
production, 
which in this case is for matching a String (I presume this is how you are 
storing 
bytearrays). A String may seem like a simple thing but it actually require a 
fair 
amount of escaping and unescaping character by character. 

In an earlier version of Gson, we used a recursive production to match a String 
as an 
array of characters. This resulted in a StackOverflowError for 100KB strings. 
In a 
later version, I revised it to a production that does a single token match. 
Last I 
tested, Gson could handle strings of over 20MB in size. 

Can you give us more details (may be post a code snippet) on how are the bytes 
stored? Are you using a byte[] or are you using a String?

Original comment by inder123 on 13 Jul 2009 at 9:03

GoogleCodeExporter commented 9 years ago

Wrote performance tests for byte array serialization and deserialization in 
r430. Gson 
failed at serializing arrays beyond 8MB, but for deserialization it failed for 
arrays 
as small as 32KB. So, seems like we have a real performance issue here that we 
need 
fixing.

Original comment by inder123 on 13 Jul 2009 at 10:29

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Ok, I revised my tests to determine the limit somewhat more precisely. Gson 
failed at 
deserialization on byte arrays of size beyond 80k. 

The primary reason is that Gson matches an Array as a production of Array 
elements. 
The Elements themselves are matched with a recursive production (and therein 
lies the 
problem): 

private JsonArray JsonArray() :
{ JsonArray array = new JsonArray(); }
{
  "[" [ Elements(array) ] "]"
  {
    array.reverse();
    return array;
  }
}

private void Elements(JsonArray array) :
{
  JsonElement element;
}
{
  element=JsonValue() [ "," Elements(array) ]
  { array.add(element); }
}

Any suggestions from anyone on how to improve these productions?

Original comment by inder123 on 13 Jul 2009 at 10:45

GoogleCodeExporter commented 9 years ago

Your production appears to approach N^2 in memory usage, since you pass the 
entire 
parsed array into the recursive call.  The top-level recursion gets a 0-item 
JsonArray, the first inner recursion gets a 1-item array, and so on, up to N 
items.  
But those recursive calls will stay on the stack until the whole list is 
parsed, so 
you end up with 1+2+3+...+N items on the stack.

What if you built the JsonArray iteratively in the top-level rule instead, so 
that 
the single 'result' array could be updated without passing it into another 
rule?    I 
haven't played with parser generators since ANTLR back in my compiler class, so 
I'm 
not well-versed enough in JavaCC semantics to write up the rule, but I'm 
imagining a 
grammar like this (in EBNF):

    JsonArray : '[' JsonArrayElement ( ',' JsonArrayElement )+ ']'

I'm not sure how that translates to JavaCC, or how you'd get the next element 
back,  
but the point is to not pass a potentially huge data structure around in the 
call 
parameters and build it up iteratively instead.

Original comment by kenot...@gmail.com on 14 Jul 2009 at 5:54

GoogleCodeExporter commented 9 years ago

If you want to keep your current rules, you could perhaps instead make a single 
globally-scoped JsonArray...so long as you only parse a single array at a time, 
this 
may work better with your current rule...globally-scoped variables are not 
passed on 
the stack like parameters are, so you should be able to scan arrays all the way 
up to 
the size of the JVM's memory.

Original comment by kenot...@gmail.com on 15 Jul 2009 at 2:12

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

The limitation is much smaller on devices with limited resources, such as 
Android phones.  I'm running into this 
problem trying to deserialize a string that's about 20k on Android 1.5.

Can anyone think of a constructive way to work around this issue for now?  
Would writing a custom deserializer 
help, or is the problem more low-level than that?

Original comment by mbur...@gmail.com on 5 Aug 2009 at 5:11

GoogleCodeExporter commented 9 years ago

Unfortunately the problem is more low-level since our JavaCC parser uses 
recursive
productions. The parser gets invoked way before the deserializers are run. 

We will try to address this in our next release, but it would greatly help if 
others
can also provide alternate JavaCC productions for Javascript Arrays.

Original comment by inder123 on 5 Aug 2009 at 8:46

GoogleCodeExporter commented 9 years ago

Recursive rules are fine when your input is guaranteed to be small, but the way 
your 
JsonArray rule is written, you pass the whole array into each recursive 
call...that's just 
plain unnecessary, and it quickly consumes all available memory for even 
moderately-sized 
arrays.  Instead, rephrase your grammar as left-associative rather than 
right-recursive.  
Read values from the left end of the input and immediately add them to the 
output array, then 
consume the comma and iterate.  Something like this:

private JsonArray JsonArray() :
{ JsonArray array = new JsonArray(); JsonElement elem = null; }
{
  "[" ( elem=JsonValue() {array.add(elem);} [","] )* "]"
  {    return array;  }
}

There's no need for a second Elements() rule here at all, and since the 
elements are read in 
the proper order to begin with, there's no need to reverse the array either.

But also note that this is a rather lax grammar, since it doesn't validate that 
the array is 
proper JSON...the comma is optional only so that the last value is matched, but 
because its 
optional you could have bogus JSON like '[ 12 34, 56 ]' return the incorrect 
array 
(12,34,56).  If that matters to you, I'll leave the solution to you.

I generated this to a java class and plugged in my replacement JsonArray 
implementation into 
the GSON source, then compiled and ran it against 234 KB of json text 
representing a byte 
array of about 64,000 items, and it successfully returned the array.  It was 
horribly slow, 
about 3 minutes, but it did return eventually...not sure why it's so slow, 
perhaps since I 
have done the rule for you here, you can profile it and find out where all the 
time is being 
spent.  You may have other recursive rules that can be similarly re-written to 
reduce your 
parser's memory footprint significantly.

Other than this problem (a major one IMHO), GSON has been great, definitely one 
of the better 
Java <-> JSON converters out there.  Keep up the good work!

PS: when I convert my class to use a String to contain the byte array, it 
serializes and 
deserializes without problems, and extremely quickly too (no 3 minute wait).  I 
see you defer 
this parsing to StringUnmarshaller.  I suspect your parser is doing a lot of 
backtracking, 
spinning its wheels trying to find a matching production, but a profiler will 
tell you for 
sure.

Original comment by kenot...@gmail.com on 14 Aug 2009 at 8:38

GoogleCodeExporter commented 9 years ago

Fixed in r438
Thanks for the tip, kenotron. I was able to build on your productions to come 
up with
something that enforces the proper JSON rules.

Original comment by inder123 on 22 Aug 2009 at 1:04

Changed state: Fixed
Added labels: Milestone-Release1.4

GoogleCodeExporter commented 9 years ago

Lovely!  Preliminary testing on r438 seems to work fine for me

Original comment by mbur...@gmail.com on 22 Aug 2009 at 2:10

discomarathon / google-gson

StackOverflowError with long Collections #96