7even / endless-ships

Online reference to the Endless Sky game
http://endless-sky.7vn.io/
GNU General Public License v3.0
19 stars 10 forks source link

boot build fails with OutOfMemoryError: Java heap space #19

Closed akosiaris closed 4 days ago

akosiaris commented 6 months ago

Hi,

Thanks for this, it's a pretty useful tool for me, when trying to figure out which ship I should fly/outfit I should use.

I have a problem, I am trying to build this using the instructions in the README and failing. It's not entirely consistent but the error is always the same, namely lack of Java Heap Space. Pasting below the last 3 runs, edited for brevity

1st try

$ docker build .
Sending build context to Docker daemon  785.9kB
Step 1/13 : FROM clojure:boot-alpine AS boot
[snip]
Retrieving bcprov-jdk15on-1.62.jar from https://repo1.maven.org/maven2/ (4451k)
    endless-ships.parser/fn/fn  parser.clj:   51
    endless-ships.parser/parse  parser.clj:   43
       /instaparse.core.Parser    core.clj:  141
                           ...                  
         instaparse.core/parse    core.clj:   47
         instaparse.core/parse    core.clj:   94
          instaparse.gll/parse     gll.clj:  979
            instaparse.gll/run     gll.clj:  449
            instaparse.gll/run     gll.clj:  463
           instaparse.gll/step     gll.clj:  445
 instaparse.gll/push-message/f     gll.clj:  272
instaparse.gll/NodeListener/fn     gll.clj:  501
    instaparse.gll/push-result     gll.clj:  362
   instaparse.gll/push-message     gll.clj:  280
            clojure.core/swap!    core.clj: 2262
                           ...                  
            clojure.core/assoc    core.clj:  191
                           ...                  
java.lang.OutOfMemoryError: Java heap space
clojure.lang.ExceptionInfo: Java heap space
    line: 47
The command '/bin/sh -c boot build' returned a non-zero code: 1

2nd try

$ docker build .
Sending build context to Docker daemon  785.9kB
Step 1/13 : FROM clojure:boot-alpine AS boot
[snip]
Retrieving bcprov-jdk15on-1.62.jar from https://repo1.maven.org/maven2/ (4451k)
        /instaparse.core.Parser  core.clj:  141
                            ...                
          instaparse.core/parse  core.clj:   47
          instaparse.core/parse  core.clj:   94
           instaparse.gll/parse   gll.clj:  979
             instaparse.gll/run   gll.clj:  449
             instaparse.gll/run   gll.clj:  463
            instaparse.gll/step   gll.clj:  445
instaparse.gll/push-listener/fn   gll.clj:  383
          instaparse.gll/-parse   gll.clj:  136
       instaparse.gll/alt-parse   gll.clj:  852
   instaparse.gll/push-listener   gll.clj:  374
        instaparse.gll/node-get   gll.clj:  320
             clojure.core/swap!  core.clj: 2262
                            ...                
             clojure.core/assoc  core.clj:  191
                            ...                
java.lang.OutOfMemoryError: Java heap space
clojure.lang.ExceptionInfo: Java heap space
    line: 47

3rd try

$ docker build .
Sending build context to Docker daemon  785.9kB
Step 1/13 : FROM clojure:boot-alpine AS boot
[snip]
Retrieving bcpkix-jdk15on-1.62.jar from https://repo1.maven.org/maven2/ (850k)
    endless-ships.parser/parse  parser.clj:   43
       /instaparse.core.Parser    core.clj:  141
                           ...                  
         instaparse.core/parse    core.clj:   47
         instaparse.core/parse    core.clj:   94
          instaparse.gll/parse     gll.clj:  979
            instaparse.gll/run     gll.clj:  449
            instaparse.gll/run     gll.clj:  463
           instaparse.gll/step     gll.clj:  445
 instaparse.gll/push-message/f     gll.clj:  272
instaparse.gll/NodeListener/fn     gll.clj:  501
    instaparse.gll/push-result     gll.clj:  362
   instaparse.gll/push-message     gll.clj:  280
            clojure.core/swap!    core.clj: 2262
                           ...                  
            clojure.core/assoc    core.clj:  191
                           ...                  
java.lang.OutOfMemoryError: Java heap space
clojure.lang.ExceptionInfo: Java heap space
    line: 47

The stack trace differs a little bit each time but overall the error remains the same. The machine I try to build on has plenty of RAM (16GB) and the machine never appears under any memory stress (and the OOM killer has never showed up).

I 've already tried setting BOOT_JVM_OPTIONS=-Xmx8g (8 is the last value I tried before giving up and filing this task) but I saw no difference. I am a bit out of my depth here, my clojure knowledge is pretty limited, I 'd appreciate any help. Thanks!

7even commented 6 months ago

I would check Docker memory limits as well, they might be lower than 8gb. docker stats shows both the current memory usage and the limit.

akosiaris commented 6 months ago

I would check Docker memory limits as well, they might be lower than 8gb. docker stats shows both the current memory usage and the limit.

Indeed, they could be, but they are not in this case. In order to be set, I would need to pass -m and I don't. Nor do I have any kind of configuration in /etc/docker/daemon.json that would trigger such a thing (in fact, no such setting exists). But most importantly:

I did already blame my machine and tried it in a cloud provider. Same behavior for what is worth.

I could try and do this outside of docker entirely to see if Docker is the issue here, but I kinda doubt it at this point.

akosiaris commented 6 months ago

As an update, I reproduced this outside of Docker too. Process was pretty simple.

Get the same error and behavior.

akosiaris commented 6 months ago

I 've dug a bit more into what might be causing this. I 've tried randomly checking out various commits in this repo to see if they would build successfully, to no avail.

Then, I tried checking out various tags in the git submodule (the actual game that is). And well, v0.9.12 built just fine. What's utterly confusing to me is that a git bisect later I ended up with the following (all of this is done on the master branch of this repo).

git bisect start master v0.9.12 
Bisecting: 1565 revisions left to test after this (roughly 11 steps)
[79ae2450290a9cb18293e340ef20817895846914] feat(content): The player is contacted by Republic Intel after making contact with the Hai (#7001)
[snip]
16e7a8060a9414a992def0c6c20958ef7ed91848 is the first bad commit
commit 16e7a8060a9414a992def0c6c20958ef7ed91848
Author: Zitchas <32169904+Zitchas@users.noreply.github.com>
Date:   Fri Sep 27 07:00:33 2019 -0600

    Adding the Quantum Tear

Which is this https://github.com/endless-sky/endless-sky/commit/16e7a8060a9414a992def0c6c20958ef7ed91848. First version to be included in is v0.9.13, which is dated Apr 17, 2021.

I don't find it very plausible that for the last almost 3 years this was failing to build, but maybe it's a useful clue.

It isn't possible to cleanly revert this commit as the game data/ folder has changed. It is still doable manually to remove the outfit from the data files and remove it from the Heron ship. But it doesn't make a difference. I 've removed it from master branch for the game git submodule but still get the same error.

akosiaris commented 6 months ago

A more complete stack trace than the ones posted above (I am a clojure/boot novice, bear with me) as found out via boot -vvv build appears to be:

clojure.lang.ExceptionInfo: Java heap space {:line 47}
    at clojure.core$ex_info.invokeStatic(core.clj:4617)
    at clojure.core$ex_info.invoke(core.clj:4617)
    at boot.main$_main$fn__1317.invoke(main.clj:222)
    at boot.main$_main.invokeStatic(main.clj:216)
    at boot.main$_main.invoke(main.clj:123)
    at clojure.lang.Var.invoke(Var.java:394)
    at org.projectodd.shimdandy.impl.ClojureRuntimeShimImpl.invoke(ClojureRuntimeShimImpl.java:154)
    at org.projectodd.shimdandy.impl.ClojureRuntimeShimImpl.invoke(ClojureRuntimeShimImpl.java:145)
    at boot.App.runBoot(App.java:407)
    at boot.App.main(App.java:500)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at Boot.main(Boot.java:257)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at clojure.lang.PersistentHashMap.cloneAndSet(PersistentHashMap.java:1169)
    at clojure.lang.PersistentHashMap.access$000(PersistentHashMap.java:28)
    at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:418)
    at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:415)
    at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:142)
    at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:28)
    at clojure.lang.RT.assoc(RT.java:792)
    at clojure.core$assoc__4371.invokeStatic(core.clj:191)
    at clojure.core$assoc__4371.invoke(core.clj:190)
    at clojure.lang.Atom.swap(Atom.java:65)
    at clojure.core$swap_BANG_.invokeStatic(core.clj:2262)
    at clojure.core$swap_BANG_.invoke(core.clj:2253)
    at instaparse.gll$node_get.invokeStatic(gll.clj:320)
    at instaparse.gll$node_get.invoke(gll.clj:312)
    at instaparse.gll$push_listener.invokeStatic(gll.clj:374)
    at instaparse.gll$push_listener.invoke(gll.clj:367)
    at instaparse.gll$ordered_alt_parse.invokeStatic(gll.clj:867)
    at instaparse.gll$ordered_alt_parse.invoke(gll.clj:860)
    at instaparse.gll$_parse.invokeStatic(gll.clj:149)
    at instaparse.gll$_parse.invoke(gll.clj:130)
    at instaparse.gll$push_listener$fn__4146.invoke(gll.clj:383)
    at instaparse.gll$step.invokeStatic(gll.clj:445)
    at instaparse.gll$step.invoke(gll.clj:439)
    at instaparse.gll$run.invokeStatic(gll.clj:463)
    at instaparse.gll$run.invoke(gll.clj:447)
    at instaparse.gll$run.invokeStatic(gll.clj:449)
    at instaparse.gll$run.invoke(gll.clj:447)
    at instaparse.gll$parse.invokeStatic(gll.clj:979)
    at instaparse.gll$parse.invoke(gll.clj:974)
    at instaparse.core$parse.invokeStatic(core.clj:94)
    at instaparse.core$parse.doInvoke(core.clj:47)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
7even commented 2 weeks ago

Which is this endless-sky/endless-sky@16e7a80. First version to be included in is v0.9.13, which is dated Apr 17, 2021.

The problem now reproduces on my machine, and since I've added some logging, I see that it gets stuck on the remnant outfits.txt file which was changed in the commit you mentioned. From what I see here, there's nothing that should break it, so I may need some time to investigate.

akosiaris commented 1 week ago

Hi,

Thanks for following up.

I 've been doing some more debugging on this issue every now and then. I think it has multiple manifestations, but I also bear some good news. I found the source of one of the current manifestation you are witnessing.

First off, my git bisect above is misleading. I had a mistake in my git bisect script, which ended up identifying multiple issues, not just OutOfMemory ones. It did identify an actual issue, namely that "velocity scale" .2.5 in effect superweapon residue had a typo (note the 2 dots) but that was fixed in endless-sky/endless-sky@2a9ab8e4d387a5. You can completely disregard the findings of my git bisect, I hope you didn't waste too much time on that rabbit hole (I know I did). However, let me point out that this offered me a hint for the current issue.

Tangentially, I test on Linux, my understanding per #20 is that least some tries have happened on Windows. Docker desktop on Windows (and MacOs) starts a Linux VM (there are some differences regarding Hyper-V vs WSL 2 apparently, but ) and runs Docker in it. There some more information at https://docs.docker.com/desktop/settings/windows/#resources, my understanding is that the default is 2GB, which most definitely isn't going to be enough (I 'll explain later why) and will need tweaking.

To the point, using your latest commits to master, I can reproduce the issue you see on my machine. RSS memory usage for the JVM is up to 8400MB RAM (virtual is at 10GB RAM, but virtual isn't really useful here) before croaking with

Execution error (OutOfMemoryError) at instaparse.auto_flatten_seq.AutoFlattenSeq/conj_flat (auto_flatten_seq.clj:145).
Java heap space

Full report at:
/tmp/clojure-8038666787650120666.edn
Error: building at STEP "RUN clojure -J-Xmx8g -X:clj:build": while running runtime: exit status 1

I assume this is the same error you see. I also assume that my earlier efforts to pass -Xmx8g have been unsuccessful, for reasons I don't know but also are probably not important any more, since you ditched boot for tools.deps (thanks for that!)

Chasing down the remnant outfits.txt issue, I was able to build fine v0.10.6. From there one, there is only 1 commit in v0.10.8 that affects that file, namely endless-sky/endless-sky@a92acf949. While that is a big change, the file in question has few changes and there is a suspicious one at https://github.com/endless-sky/endless-sky/commit/a92acf949#diff-07da20eabfd0ddcd0a1f9e79d6f3ecf1d225279d058e1f721340afec9b05138dL408

Note how "hit-force"135 does not have the single space that "hit force" 90 has. And sure enough, locally messing with that single line and adding a single space character before 135, makes the build succeed.

So good news! There's a fix (and it should probably make it upstream anyway)! However, this shows a difference between the in-game parser (I am assuming it doesn't croak due to that space character missing) and this project's parser. I can try to submit a fix to parser.bnf if you want, but do let me know if that would be the wrong way to solve it.

Finally, I left for later to explain why I think a default Docker Desktop on Windows will anyway fail. While debugging this, I 've crafted the following patch to parser.clj. https://gist.github.com/akosiaris/a643e4cdd389fbf0bb81468547ade07e

There is 1 thing that stood out.

parsing: map systems.txt. Size: 696566. Used memory before parse: 1472551.6
Used memory after parse: 3708792.2

Parsing the 681KB map systems.txt file, moved the memory usage from ~1.4GB to > 3.6GB. I did notice this in RSS usage as well, so while my code is crude and quite possibly somewhat wrong (I am not sure if I should be accounting for JVM's heap, native, direct and stack memory types), it's not entirely off. Why the parser would require > 2GB RAM to parse a 681KB file is still a question whose answer eludes me, but it makes me think that a default Windows/MacOS Docker Desktop setup, just isn't gonna cut it.

akosiaris commented 1 week ago

The missing space was fixed in https://github.com/endless-sky/endless-sky/commit/d365d4ac91e62fe6d66d264be42d3f918a8d0887 fwiw

7even commented 4 days ago

Hi.

Thank you for helping to debug the issue.

I also found the problem with that "hit-force"135 without a space inside remnant outfits.txt. It was already fixed in the game repository but unfortunately the v0.10.8 version remains unparseable (and will always be). I'm not sure how to fix that in parser.bnf because it heavily relies on the fact that tokens (things like "hit-force", 135, licenses etc - the smallest bits of data processed by the parser) are always separated by whitespace characters. If you know how to change the parser grammar (parser.bnf) to support these cases, you're welcome to send a PR.

Speaking of the Windows problem: the conversation in #20 moved to the point where the parsing part began to work, and the current problem is to build the frontend app; so it seems the problem is solved by fixing the data files.

Overall it seems like the instaparse library (which handles all the parsing in this project) may go into some infinite loop or recursion when something in the parsed files does not match the parser grammar; it consumes all the memory it can get and finally crashes the JVM with that Java heap space exception. It's just my assumption, I don't really have any other ideas.

I decided to switch the Docker image to using the latest master version of the game for now (until the next release comes out), so currently both Docker and manual installation ways should work (they will use the current version of the game). Can you please check if it works on your machine so we can close the issue?

akosiaris commented 4 days ago

Hi,

Can you please check if it works on your machine so we can close the issue?

It does. As far as this issue goes, I think we can close it. The situation has changed enough to warrant this. Boot is no longer in use, the clj aliases pass a sufficiently large Java heap space parameterization for the build to work and usage by others is reportedly unblocked (per #20)

In my humble opinion, there is also an underlying issue, that is that parsing some files requires Java Heap Space memory that is orders of magnitude larger than the files themselves, which might (or might not be, I just have a gut feeling that it might) be related to the issue you describe of unexpected (for the grammar) input send instaparse into a death spiral ending with Java Heap Space exception. If we ever try to tackle this, it should be in it's own dedicated issue.

My thanks for handling this!