ganglia / monitor-core

Ganglia Monitoring core
BSD 3-Clause "New" or "Revised" License
490 stars 246 forks source link

Questions about abnormalities in the passive close of gmond #319

Closed geonmo closed 3 years ago

geonmo commented 3 years ago

Hello, everyone.

The anomaly occurs after we recently added a new data_source to ganglia.

Some other data_source will fail and the only solution will be to restart the gmond daemon for each source.

We looked at the TCP socket information to check for related symptoms.

So far, we've learned:

Considering the information above, it seems that the gmond code does not handle closing TCP sockets from requests of gmetad.

When I look at the gmond code, I suspect there is a problem with the next code.

https://github.com/ganglia/monitor-core/blob/master/gmond/gmond.c#L240-L241

From here, if it calls apr_socket_send and the ret value cannot be APR_SUCCESS, we suspect that the routine will be an infinite loop.

Of course, it is right to understand that the transfer is complete if the ret value has been sent by byte value, but there is no guarantee that an error does not produce a value less than zero.

If the above estimate is correct, could you correct the code?

If not, I would like you to tell me how to further analyze the content.

geonmo commented 3 years ago

I was late to know that the timeout value was 1 in the tcp_access_channel section in the old manual, but it was recently changed to -1. If the value is -1, it is possible to assume that gmod will wait forever when the socket is closed from gmetad side. Currently, timeout is being used with the existing setting of 1 and no error has been generated yet.

We will close the issue.