cucumber / gherkin-utils

API for working with Gherkin documents
MIT License
11 stars 4 forks source link

JVM HeapSize Error on files that are too big #50

Closed Jiehong closed 7 months ago

Jiehong commented 8 months ago

👓 What did you see?

via using spotless:check, if a gherkin file becomes too big, the formatter causes the mvn's JVM to run out of heapsize, and crashes with the following exception:

#7 161.2 [ERROR] Step 'gherkin' found problem in 'src/test/java/maestro/demo/create_workflow_definition_1_http_ok_looped.feature':

#7 161.2 Java heap space

#7 161.2 java.lang.OutOfMemoryError: Java heap space
#7 161.2     at java.util.Arrays.copyOf (Arrays.java:3537)
#7 161.2     at java.lang.AbstractStringBuilder.ensureCapacityInternal (AbstractStringBuilder.java:228)
#7 161.2     at java.lang.AbstractStringBuilder.append (AbstractStringBuilder.java:582)
#7 161.2     at java.lang.StringBuilder.append (StringBuilder.java:179)
#7 161.2     at io.cucumber.gherkin.utils.pretty.Result.append (Result.java:10)
#7 161.2     at io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString (PrettyHandlers.java:82)
#7 161.2     at io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString (PrettyHandlers.java:30)
#7 161.2     at io.cucumber.gherkin.utils.WalkGherkinDocument.walkStep (WalkGherkinDocument.java:105)
#7 161.2     at io.cucumber.gherkin.utils.WalkGherkinDocument.walkSteps (WalkGherkinDocument.java:96)
#7 161.2     at io.cucumber.gherkin.utils.WalkGherkinDocument.walkScenario (WalkGherkinDocument.java:134)
#7 161.2     at io.cucumber.gherkin.utils.WalkGherkinDocument.walkFeature (WalkGherkinDocument.java:65)
#7 161.2     at io.cucumber.gherkin.utils.WalkGherkinDocument.walkGherkinDocument (WalkGherkinDocument.java:40)
#7 161.2     at io.cucumber.gherkin.utils.pretty.Pretty.prettyPrint (Pretty.java:18)
#7 161.2     at com.diffplug.spotless.glue.gherkin.GherkinUtilsFormatterFunc.apply (GherkinUtilsFormatterFunc.java:58)
#7 161.2     at com.diffplug.spotless.FormatterFunc.apply (FormatterFunc.java:32)
#7 161.2     at com.diffplug.spotless.FormatterStepImpl$Standard.format (FormatterStepImpl.java:82)
#7 161.2     at com.diffplug.spotless.FormatterStep$Strict.format (FormatterStep.java:88)
#7 161.2     at com.diffplug.spotless.Formatter.compute (Formatter.java:246)
#7 161.2     at com.diffplug.spotless.PaddedCell.check (PaddedCell.java:126)
#7 161.2     at com.diffplug.spotless.PaddedCell.check (PaddedCell.java:98)
#7 161.2     at com.diffplug.spotless.PaddedCell.calculateDirtyState (PaddedCell.java:220)
#7 161.2     at com.diffplug.spotless.PaddedCell.calculateDirtyState (PaddedCell.java:190)
#7 161.2     at com.diffplug.spotless.maven.SpotlessCheckMojo.process (SpotlessCheckMojo.java:54)
#7 161.2     at com.diffplug.spotless.maven.AbstractSpotlessMojo.execute (AbstractSpotlessMojo.java:229)
#7 161.2     at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:328)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
#7 161.2     at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
#7 161.2     at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)

✅ What did you expect to see?

No crash.

📦 Which tool/library version are you using?

Gherkin 8.0.5, with spotless 2.37.0. Crash does not seem to depend on the spotless version.

🔬 How could we reproduce it?

Using the following maven plugin:

<plugin>
        <groupId>com.diffplug.spotless</groupId>
        <artifactId>spotless-maven-plugin</artifactId>
        <version>2.37.0</version>
        <configuration>
          <lineEndings>UNIX</lineEndings>
          <!-- define a language-specific format -->
          <gherkin>
            <includes>
              <include>src/**/*.feature</include>
            </includes>

            <gherkinUtils>
              <version>8.0.5</version>
            </gherkinUtils>
          </gherkin>
        </configuration>
      </plugin>

Define a bug xxx.feature file in your src directory, and run mvn spotless:check.

The feature file should be big enough (like 10k lines. In our case, it's because a case is using a big json input to test a service with).

mpkorstanje commented 8 months ago

Thanks for creating the report. Could you provide the exact size of the .feature file in mega bytes?

I had a quick look at the code, but I don't think there is much that can be done. The current implementation creates a String using a StringBuilder which means that we have a few copies of the file in memory. This could be made more efficient by writing to an OutputStream instead. But then Spotless, would have to turn that OutputStream into a String anyway to do their comparison.

If you do have a better idea to solve this, please feel free to make a suggestion.

Jiehong commented 8 months ago

File size was 180kB, so not even in mega bytes.

Doesn't feel that big to me. In the end, we've found a workaround by extracing the 170K json into its own json file, and introducing it as a variable:

    * def myData = read ("file:src/test/java/xxx/big_file.json")

This way gherkin does not need to try to format a "big" file.

Otherwise, I'm not quite sure how to better handle it.

(we tried passing -Xmx 2048m to the jvm.config options for maven, but it didn't help for some reasons.)

mpkorstanje commented 8 months ago

Ouch. That doesn't seem big indeed. At this point I'd attach the JVM console and have a look at where the memory goes.

Personally, at present, I don't have the time to dig deeper though. If you or someone else does have the time available it would be most welcome.

Jiehong commented 8 months ago

Dumping on OOM leads to some information (-Xmx64M):

Thread 'mvn-builder-xxx' with ID = 29
    java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
    jdk.internal.misc.Unsafe.allocateUninitializedArray(Unsafe.java:1380)
    java.lang.StringConcatHelper.newArray(StringConcatHelper.java:511)
    java.lang.StringLatin1.replace(StringLatin1.java:362)
    java.lang.String.replace(String.java:3100)
    io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString(PrettyHandlers.java:71)
    io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString(PrettyHandlers.java:30)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkStep(WalkGherkinDocument.java:105)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkSteps(WalkGherkinDocument.java:96)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkScenario(WalkGherkinDocument.java:134)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkFeature(WalkGherkinDocument.java:65)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkGherkinDocument(WalkGherkinDocument.java:40)
    io.cucumber.gherkin.utils.pretty.Pretty.prettyPrint(Pretty.java:18)
    com.diffplug.spotless.glue.gherkin.GherkinUtilsFormatterFunc.apply(GherkinUtilsFormatterFunc.java:58)
    com.diffplug.spotless.FormatterFunc.apply(FormatterFunc.java:32)
    com.diffplug.spotless.FormatterStepImpl$Standard.format(FormatterStepImpl.java:82)
    com.diffplug.spotless.FormatterStep$Strict.format(FormatterStep.java:88)
    com.diffplug.spotless.Formatter.compute(Formatter.java:246)
    com.diffplug.spotless.PaddedCell.check(PaddedCell.java:126)
    com.diffplug.spotless.PaddedCell.check(PaddedCell.java:98)
    com.diffplug.spotless.PaddedCell.calculateDirtyState(PaddedCell.java:220)
    com.diffplug.spotless.PaddedCell.calculateDirtyState(PaddedCell.java:190)
    com.diffplug.spotless.maven.SpotlessCheckMojo.process(SpotlessCheckMojo.java:54)
    com.diffplug.spotless.maven.AbstractSpotlessMojo.execute(AbstractSpotlessMojo.java:229)
    org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:126)
    org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:328)
    org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:316)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:174)
    org.apache.maven.lifecycle.internal.MojoExecutor.access$000(MojoExecutor.java:75)
    org.apache.maven.lifecycle.internal.MojoExecutor$1.run(MojoExecutor.java:162)
    org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute(DefaultMojosExecutionStrategy.java:39)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:159)
    org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:105)
    org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:193)
    org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:180)
    java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    java.lang.Thread.runWith(Thread.java:1596)
    java.lang.Thread.run(Thread.java:1583)

This seems to make sense, as the big json in the gherkin file is defined as:

Feature:
  Background:
  # not much herer

  Scenario:
  * def myData =
    """
{
super long json here
over 9000 lines
}
    """

# rest of the scenario test afterwards, just for a few lines
Jiehong commented 8 months ago

With -Xmx128M, a different one occurs:

Thread 'mvn-builder-xxxx' with ID = 29
    java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
    java.util.Arrays.copyOf(Arrays.java:3541)
    java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:242)
    java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:587)
    java.lang.StringBuilder.append(StringBuilder.java:179)
    io.cucumber.gherkin.utils.pretty.Result.append(Result.java:10)
    io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString(PrettyHandlers.java:82)
    io.cucumber.gherkin.utils.pretty.PrettyHandlers.handleDocString(PrettyHandlers.java:30)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkStep(WalkGherkinDocument.java:105)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkSteps(WalkGherkinDocument.java:96)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkScenario(WalkGherkinDocument.java:134)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkFeature(WalkGherkinDocument.java:65)
    io.cucumber.gherkin.utils.WalkGherkinDocument.walkGherkinDocument(WalkGherkinDocument.java:40)
    io.cucumber.gherkin.utils.pretty.Pretty.prettyPrint(Pretty.java:18)
    com.diffplug.spotless.glue.gherkin.GherkinUtilsFormatterFunc.apply(GherkinUtilsFormatterFunc.java:58)
    com.diffplug.spotless.FormatterFunc.apply(FormatterFunc.java:32)
    com.diffplug.spotless.FormatterStepImpl$Standard.format(FormatterStepImpl.java:82)
    com.diffplug.spotless.FormatterStep$Strict.format(FormatterStep.java:88)
    com.diffplug.spotless.Formatter.compute(Formatter.java:246)
    com.diffplug.spotless.PaddedCell.check(PaddedCell.java:126)
    com.diffplug.spotless.PaddedCell.check(PaddedCell.java:98)
    com.diffplug.spotless.PaddedCell.calculateDirtyState(PaddedCell.java:220)
    com.diffplug.spotless.PaddedCell.calculateDirtyState(PaddedCell.java:190)
    com.diffplug.spotless.maven.SpotlessCheckMojo.process(SpotlessCheckMojo.java:54)
    com.diffplug.spotless.maven.AbstractSpotlessMojo.execute(AbstractSpotlessMojo.java:229)
    org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:126)
    org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:328)
    org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:316)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:174)
    org.apache.maven.lifecycle.internal.MojoExecutor.access$000(MojoExecutor.java:75)
    org.apache.maven.lifecycle.internal.MojoExecutor$1.run(MojoExecutor.java:162)
    org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute(DefaultMojosExecutionStrategy.java:39)
    org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:159)
    org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:105)
    org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:193)
    org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:180)
    java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    java.lang.Thread.runWith(Thread.java:1596)
    java.lang.Thread.run(Thread.java:1583)
Jiehong commented 8 months ago

Can't upload a screenshot, but the analysis of the biggest objects are:

1st case: 18MB int[] from StringLAtin1:382 whose length is 4.5 millions

2nd case: 27MB String from PrettyHandlers.java:82 (whose value is a {\"\"\"\"\"\"\"\"... tons and tons of escaped "

Jiehong commented 8 months ago

Looks like there might be some duplicated escaped double strings growing very big, and creating huge objects in memory (might not be helped with String's immutability in the first case).

Jiehong commented 8 months ago

hoping this helps

mpkorstanje commented 8 months ago

Interesting, it looks like this may not be correct:

https://github.com/cucumber/gherkin-utils/blob/ddbe191882fb21aabbcf7e4f28a8bedcfb06d0bc/java/src/main/java/io/cucumber/gherkin/utils/pretty/PrettyHandlers.java#L69-L75

I would expect the replaced string to be equal to the delimiter in both cases.

If you use a smaller json in the doc string, does it even format correctly?

Jiehong commented 8 months ago

That's an interesting question!

Just gave it a try, and got weird results:

This allowed me to create a very simple case where the content of the "docstring" fully disable the formatting for that file (or crashes it if too big) if the docstring contains some json.

Here is a way to reproduce:

Feature: my feature

  Background:
    * url superUrl
    # Testing a "docstring"
    * configure thing =
      """
    {
    "key": "value"
    }
      """

Expectations: file reformatted (some empty lines to be removed)

Reality: file considered already formatted.

If you try with this instead:

Feature: my feature

  Background:
    * url superUrl
    # Testing a "docstring"
    * configure thing =
      """
I'm something else
      """

Expectations and reality match: file gets reformatted.

mpkorstanje commented 7 months ago

Cheers. The formatting stuff is relatively easy to fix.

It may also fix the memory issue because the pretty formatter won't be replacing every " with \"\"\". Though that would increase the size by a factor of 6 at most. You could try building #58 from source and add it as a <dependency> to the spotless plugin.

Jiehong commented 7 months ago

58 seems to no longer be out of heap space, and now all files with docstring """ are also correctly formatted.

mpkorstanje commented 7 months ago

Should be released soon. You'll have to ping Spotless for dependency updates.