Investigate Java 21 and Jruby compatibility

roaksoax commented 1 year ago

Java 21 is now available and we would like to make it the default for Logstash. However, we need to investigate whether it is possible provided Jruby supports it.

Deprecation list: https://docs.oracle.com/en/java/javase/21/docs/api/deprecated-list.html Dependant tasks:

[x] Fix argument error in JRuby and JDK 21 https://github.com/jruby/jruby/issues/8061

Depending tasks:

[x] Update Derby https://github.com/logstash-plugins/logstash-integration-jdbc/pull/148 fixed by https://github.com/logstash-plugins/logstash-integration-jdbc/pull/155 and https://github.com/logstash-plugins/logstash-integration-jdbc/pull/160

Other Tasks

[x] Adaptations to run on JDK 21 #15719
[ ] Test and verify all plugins are supported on newer JDK21 (with current jruby). #16055
[ ] Bundle JDK 21 and fix all TODO (introduced in #15719) related to getId been deprecated in JDK 19 and replaced by threadId() starting from JDK 21
[x] Check if performance drop due to JDK 21 removal of Preventive GC flag, ES start seeing throubles from JDK 20

andsel commented 9 months ago

As reported in https://github.com/jruby/jruby/issues/8061#issuecomment-1908807511 JDK 21 LinkedHashMap introduce a new method (map), not present in JDK 17 and that interfere with JRuby map method.

andsel commented 9 months ago

As reported in https://github.com/jruby/jruby/issues/8061#issuecomment-1933009986 the fix will be included in JRuby 9.4.6.0. The temporary fix would be to add

java.util.LinkedHashSet.remove_method(:map) rescue nil

in rspec bootstrap script: https://github.com/elastic/logstash/blob/4e98aa811701cb9940984d2b43d62ee81d46c6b0/lib/bootstrap/rspec.rb#L18

andsel commented 7 months ago

Analysis of removal of Preventive GC flag on Logstash

Definition of which problem preventive GC was intended to resolve

JDK 17 introduced the flag G1UsePreventiveGC to resolve a problem in G1 evacuation where there are a lot of short lived humongous objects (humongous means object occupation bigger than 1/2 of a region size). Discussed in https://tschatzl.github.io/2021/09/16/jdk17-g1-parallel-gc-changes.html the problem consists in 0 objects copied during evacuation phase because the count of such object raised so quickly and there isn't Eden or Survivor regions available to move, so needs a FullGC (that Stop The World) do to in-place compaction. The flag was introduced to do some preventive unscheduled GC cycles to avoid reach the situation of humongous objects saturate the humongous regions, so essentially to preserve space to copy object during evacuation and avoid a FullGC. With JDK 20 the flag was deprecated and defaulted to false, with JDK 21 is has been removed.

Elasticsearch use case

Elasticsearch data node load a lot of 4MB byte[] chuncks of data to be passed down to ML node(but happens also in other case, not limited only to ML case). This generate a lot of humogous allocations (humongous objects are object with size >= 1/2 of region size), in general a spike in allocations would generate an OOM error in the JVM, but ES is able to protect against it with a circuit breaker, and exactly that showed up with a lot circuit breaker exceptions with the memory stying high insted of getting freed and kept lower thanks to the G1 Preventive Collection phases.

How ES solved the issue ES is resolving this trying to allocate less humongous objects.

Logstash use case

Logstash has some peculiarities:

allocation is governed by the environment, the clients push data into inputs or is pulled in from inputs.
there isn't any explicit circuit breaker to avoid memory exhaustion.
the limitation mechanism is the in-memory queue, where if the upstream is going too fast then it works as a bounding mechanism by blocking.

Queue full case If the queue is full and is limiting the input, then at a certain point the allocation rate is not high, given that the references are in queue and stay there for relatively long periods, likely those objects transition into tenured regions (old generation) and doesn't have any benefit from preventive GCs.

So from this perspective having or not preventive GCs doesn't provide any improvement.

Queue empty and fast consumers In this case the queue is almost full, consumers are able to cope with producers. When allocation rate is high and pipelines queues have enogth space to keep live all the events (big objects >= 2MB), being that there isn't any circuit breaker protection the preventive GCs offer limited relieve, JVM hosting Logstash is destined to go OOM without preemptively limiting the allocation rate.

Also in this case having or not preventive GCs doesn't provide improvements.

Considerations

Given the discussion above, preventive GCs doesn't play an important role for Logstash memory management.

How I've done some tests

Used the following pipeline, which is pretty fast and keeps the queue mostly empty:

input {
  http {
    response_headers => {"Content-Type" => "application/json"}
    ecs_compatibility => disabled
  }
}
output {
  sink {}
}

Created a file of 4MB single line of text. Run wrk with following Lua script:

wrk.method = "POST"
local f = io.open("input_sample.txt", "r")
wrk.body   = f:read("*all")

wrk --threads 4 --connections 12 -d10m -s wrk_send_file.lua --latency http://localhost:8080

andsel commented 7 months ago

Reopen because inadvertently closed by #15719

roaksoax commented 7 months ago

Closing this issue since now Logstash will support JDK 21. The discussion to decide whether we make it default its followed on a different thread.

elastic / logstash