cirrax / puppet-libvirt

puppet module for libvirt
GNU General Public License v3.0
8 stars 25 forks source link

Error 500 on SERVER: java.lang.StackOverflowError #73

Open jadestorm opened 1 year ago

jadestorm commented 1 year ago

Hi folk! Please brace for a lot of data... we've been trying to track down this issue for a couple of weeks now. Some background -- we are aiming to switch from the thias libvirt implementation to this one, especially since it has some nice profile support.

The short version of what's happening is -- for the first batch of runs it's fine, but then after a bit we regularly get java.lang.StackOverflowError and the catalog build fails. Exact error from the perspective of the puppet agent is:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: java.lang.StackOverflowError

Back on the Puppet server itself, we see the entire stack trace and .. I'm not going to paste the ENTIRE thing because it repeats a lot but here we go:

java.lang.StackOverflowError: null
 #### Start of repeating block
        at org.jruby.RubyClass.finvoke(RubyClass.java:510)
        at org.jruby.runtime.Helpers.invoke(Helpers.java:644)
        at org.jruby.RubyBasicObject.callMethod(RubyBasicObject.java:386)
        at org.jruby.RubyEnumerator.__each__(RubyEnumerator.java:359)
        at org.jruby.RubyEnumerator.each(RubyEnumerator.java:354)
        at org.jruby.RubyEnumerator$INVOKER$i$each_DBG.call(RubyEnumerator$INVOKER$i$each_DBG.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:208)
        at org.jruby.RubyEnumerable.callEach(RubyEnumerable.java:87)
        at org.jruby.RubyEnumerable.sort_by(RubyEnumerable.java:527)
        at org.jruby.RubyEnumerable$INVOKER$s$0$0$sort_by_DBG.call(RubyEnumerable$INVOKER$s$0$0$sort_by_DBG.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:208)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:221)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.invokeOther2:sort_by(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:9)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.RUBY$method$each_attribute$0(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:9)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.RUBY$method$each_attribute$0$__VARARGS__(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:8)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:139)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:112)
        at org.jruby.internal.runtime.methods.AliasMethod.call(AliasMethod.java:133)
        at org.jruby.RubyClass.finvokeWithRefinements(RubyClass.java:522)
 #### End of repeating block
         at org.jruby.RubyClass.finvoke(RubyClass.java:510)
        at org.jruby.runtime.Helpers.invoke(Helpers.java:644)
        at org.jruby.RubyBasicObject.callMethod(RubyBasicObject.java:386)
        at org.jruby.RubyEnumerator.__each__(RubyEnumerator.java:359)
        at org.jruby.RubyEnumerator.each(RubyEnumerator.java:354)
        at org.jruby.RubyEnumerator$INVOKER$i$each_DBG.call(RubyEnumerator$INVOKER$i$each_DBG.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:208)
        at org.jruby.RubyEnumerable.callEach(RubyEnumerable.java:87)
        at org.jruby.RubyEnumerable.sort_by(RubyEnumerable.java:527)
        at org.jruby.RubyEnumerable$INVOKER$s$0$0$sort_by_DBG.call(RubyEnumerable$INVOKER$s$0$0$sort_by_DBG.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:208)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:221)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.invokeOther2:sort_by(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:9)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.RUBY$method$each_attribute$0(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:9)
        at etc.puppetlabs.code.environments.chass_master.modules.libvirt.lib.puppet_x.libvirt.rexml_sorted_attributes.RUBY$method$each_attribute$0$__VARARGS__(/etc/puppetlabs/code/environments/chass_master/modules/libvirt/lib/puppet_x/libvirt/rexml_sorted_attributes.rb:8)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:139)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:112)

Clearly a recursion issue of sorts. The repeated block was 53 times on this run. Now -- we have a good deal more resources assigned to our Puppet servers than is even recommended, approximately 70 hosts, fairly large sized catalogs but we have not had this issue with any other modules.

We were running Puppet 7, we upgraded to Puppet 8, same issue. We even assigned the Puppet server more ram (heap size) just to see if it would help anything. (It did not) We went from the default RedHat Java version of 8 to 17. We've tried quite a few things and have not yet determined what is going on. We also updated rexml after initial failures. (which briefly seemed to help but it was a red herring)

One of my coworkers is actively working on setting up as close a similar environment as possible to what we run in production to see if it happens there. So far we've seen what is either useful or a red herring -- he hasn't been able to reproduce the issue with JUST libvirt in the manifest -- but he's also not running it over and over and over again before it eventually fails, and on the same server that a bunch of other hosts are accessing.

This only happens when adding a domain -- if we comment out the domain part, there is no problem.

Here is a paste of the domconf from hiera:

libvirt::profiles::domconf:
  rocky86:
    memory:
      values: 8388608
      attrs:
        unit: 'KiB'
    vcpu:
      values: 4
      attrs:
        placement: 'static'
        current: '2'
    os:
      values:
        type:
          values: hvm
          attrs:
            arch: 'x86_64'
            machine: 'pc-q35-rhel8.6.0'
        loader:
          values: '/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd'
          attrs:
            readonly: 'yes'
            secure: 'yes'
            type: 'pflash'
        boot:
          attrs:
            dev: 'hd'
      features:
        values:
          acpi: {}
          apic: {}
          pae: {}
      clock:
        attrs:
          offset: 'utc'
      on_poweroff: 'destroy'
      on_reboot: 'restart'
      on_crash: 'restart'
    metadata:
      values:
        'libosinfo:libosinfo':
          values:
            'libosinfo:os':
              attrs:
                id: 'http://rockylinux.org/rocky/8.6'
          attrs:
            'xmlns:libosinfo': 'http://libosinfo.org/xmlns/libvirt/domain/1.0'

And create_domains:

libvirt::create_domains:
  testvm:
    ensure: 'present'
    type: 'kvm'
    domain_title: 'testvm'
    description: 'A Test VM'
    boot: 'hd'
    autostart: true
    active: true
    replace: false
    show_diff: true
    dom_profile: rocky86
    domconf:
      memory:
        values: '2048'
        attrs:
          unit: 'MiB'
    disks:
      - type: file
        device: disk
        bus: virtio
        source:
          file: /srv/kvm_images/testvm.img
        driver:
          name: qemu
          type: qcow2
    interfaces:
      - network: ournetwork
        type: virtio

At present I'm all out of ideas. I'm certainly willing to do some coding work and submit a PR but at the moment I'm not sure where to start. My only thought is if there's something wrong with the sorting mechanism -- but I looked over it and certainly nothing jumps out.

Thanks for any help or suggestions y'all can provide and let me know if there's any other info you need!

trefzer commented 3 months ago

@jadestorm does this issue still persist, or could you solve ? What version of puppetserver are you using. If puppetserver 8 is in use, you need to install the REXML gem into the puppetserver (see README). Did you do so, which version ?

ncstate-daniel commented 3 months ago

(this is jadestorm from my work account) Hi @trefzer we shelved it for the time being while we work on some other priorities, but it's still something we plan on getting back to. Generally we update our puppetserver to the latest 8 series every couple of months, so "effectively" at the latest greated.

This is the version of rexml we currently have installed. BTW we install it alongside the puppetserver ruby install. [root@hss10lm bin]# ./gem list | grep -i rexml rexml (3.2.5)

A slightly older version is installed at the system level but, to my knowledge, that should not affect anything running under puppetserver: [root@hss10lm bin]# /usr/bin/gem list | grep -i rexml rexml (default: 3.2.3.1)

That said, right now I can not confirm or deny that it is still a problem. We would need to get back to workng on it.

Oh while I'm at it here is our current puppet bits: [root@hss10lm bin]# rpm -qa | grep puppet puppet-agent-8.8.1-1.el8.x86_64 puppetserver-8.6.2-1.el8.noarch puppetdb-termini-8.7.0-1.el8.noarch puppet-agent-oauth-0.5.10-1.el8.noarch