Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

Add prometheus jmx javaagent thrown exception 'Address already in use' when monitor starting #16657

Open humengyu2012 opened 1 year ago

humengyu2012 commented 1 year ago

Alluxio Version: 2.9.0

Describe the bug I added prometheus javaagent in alluxio-env.sh:

ALLUXIO_WORKER_JAVA_OPTS="$ALLUXIO_WORKER_JAVA_OPTS -Xms16G -Xmx32G -XX:+UseG1GC -javaagent:${ALLUXIO_HOME}/extensions/jmx_prometheus_javaagent-0.15.0.jar=13753:${ALLUXIO_HOME}/conf/jmx-prometheus.yaml "

when I start worker, the exception is thrown:

Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
    at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at sun.net.httpserver.ServerImpl.<init>(ServerImpl.java:100)
    at sun.net.httpserver.HttpServerImpl.<init>(HttpServerImpl.java:50)
    at sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
    at com.sun.net.httpserver.HttpServer.create(HttpServer.java:130)
    at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer.<init>(HTTPServer.java:179)
    at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:31)
    ... 6 more

To Reproduce add the following env in alluxio-env.sh and run worker:

ALLUXIO_WORKER_JAVA_OPTS="$ALLUXIO_WORKER_JAVA_OPTS -Xms16G -Xmx32G -XX:+UseG1GC -javaagent:${ALLUXIO_HOME}/extensions/jmx_prometheus_javaagent-0.15.0.jar=13753:${ALLUXIO_HOME}/conf/jmx-prometheus.yaml "

Are you planning to fix it Please indicate if you are already working on a PR.

Additional context Why ALLUXIO_WORKER_MONITOR_JAVA_OPTS extends ALLUXIO_WORKER_JAVA_OPTS? I think it is unnecessary because worker could use a lot of memory(32G), but the monitor just need 4G or less memory. If I set -Xms16G -Xmx32G in ALLUXIO_WORKER_JAVA_OPTS that means monitor will run with -Xms16G -Xmx32G.

image
dbw9580 commented 1 year ago

I see you are trying to attach a java agent profiler via the environment variable. You can do so by specifying the lib in a dedicated env var ALLUXIO_WORKER_ATTACH_OPTS, see the documentation here https://docs.alluxio.io/os/user/stable/en/administration/Troubleshooting.html#debugging-alluxio-processes

Also see related discussion here https://github.com/Alluxio/alluxio/issues/15168#issuecomment-1072641897

dbw9580 commented 1 year ago

The monitor env vars were introduced in https://github.com/Alluxio/alluxio/pull/13576 to address a similar issue. The PR was mostly a workaround which removes JDWP specific options. I am not sure why ALLUXIO_WORKER_MONITOR_JAVA_OPTS must extend from ALLUXIO_WORKER_JAVA_OPTS, but from the context of #13576 I guess is that it's mostly aimed to stay backward compatible.

Judging from the contexts, I'd say ALLUXIO_WORKER_MONITOR_JAVA_OPTS should not duplicate ALLUXIO_WORKER_JAVA_OPTS and instead should have a different set of defaults. @humengyu2012 I'd really appreciate it if you can come up with a fix PR and test if the a dedicated set of defaults for the monitor should be appropriate.

LuQQiu commented 1 year ago

Fixed by https://github.com/Alluxio/alluxio/pull/16688

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.