OpenTSDB 2.3 occasionally stops logging to opentsdb.log

OpenTSDB / opentsdb

A scalable, distributed Time Series Database.

http://opentsdb.net

GNU Lesser General Public License v2.1

4.99k stars 1.25k forks source link

OpenTSDB 2.3 occasionally stops logging to opentsdb.log #1492

Open HariSekhon opened 5 years ago

HariSekhon commented 5 years ago

Occasionally OpenTSDB 2.3 stops logging to opentsdb.log for an extended period of time and doesn't resume logging even thought the process is still up and under normal operation we have logs every single second from all our client connections and rpc connections to HBase.

This seems like a bug.

Restarting OpenTSDB resumes logging like normal.

HariSekhon commented 5 years ago

This has happened a few more times, after I finished doing several thousand metric exports, and imports under new name ('sed' export files), scan --delete and finally old uid deletes. All of those operations completed successfully, it's almost like the load put opentsdb to a different code path where it didn't continue logging. It certainly didn't die as process was still up and if it was going to crash I would have expected part way through not after successful completion of iterating through thousands of metric migrations.

manolama commented 5 years ago

That's a really odd one. Can you capture a jstack of the process when it hangs like that to see what threads are where? Also, the disk hasn't run out of space right?

HariSekhon commented 5 years ago

No I did check disk space was fine.

We added monitoring to detect this so the next time that happens I'll run a jstack.

HariSekhon commented 5 years ago

It seemed like when I did a lot of activity on opentsdb, and then my script doing that activity stopped, it stopped logging for several hours even though I have requests going to it every second that should have been logged.

After a restart it started logging connections again every second.

manolama commented 5 years ago

And does this happen when you run a CLI command and a TSD daemon is running? If so it could be the same as https://github.com/OpenTSDB/opentsdb/issues/1474 in which case could you try adding the prudent config @koketani and let us know if it helps please?

HariSekhon commented 5 years ago

I've tried rolling out prudent config to my clusters last week and ensure they were all restarted with the new logback.xml config (they also log to new paths with date in the filename), but the issue still recurred.

I run a weekly job to age out metrics from the sandbox. prefix of the metric list, first dumping the metric list and then iterating over each metric with a tsdb scan delete.

It is when this process of running lots of tsdb commands finishes that the main opentsdb daemon which is running on the local node stops logging.

Switching to the prudent config hasn't resolved the issue.