Scrin / RuuviCollector

Utility to collect measurements from RuuviTags and store them in InfluxDB
MIT License
125 stars 62 forks source link

Exit uncleanly when database connection is lost #68

Closed anon8675309 closed 2 years ago

anon8675309 commented 2 years ago

Describe the feature you'd like to have implemented When the database connection is lost, I'd like the software to exit with a non-zero return code.

Describe alternatives you've considered Instead of exiting, just keep trying to reconnect forever. This was discussed in #8 but the decision was to exit uncleanly and allow the supervisor program (e.g., systemd) to deal with restarting failed services.

The scope of #8 was only if the service could not make the initial database connection (e.g., if the service were started before the database was ready). It does not handle the case where the database connection is established but then later lost.

Do you want to implement this feature? I'd be happy to, but if someone else wanted to that'd be fine too.

Additional info When a database connection currently drops, the error message printed is this one:

Uncaught exception while handling measurements from MAC address "00FF00AABBCC", if this repeats and this is not a Ruuvitag, try blacklisting it

This proposed change would have a more specific error message in the event that the database save was what failed.

anon8675309 commented 2 years ago

I've made this change and am currently testing it on my production server. Once it's sufficiently tested, I'll submit a merge request.

Scrin commented 2 years ago

That looks like a weird error message for lost connections, connection issues shouldn't be thrown all the way to main in normal circumstances, which makes me wonder if there's something else wrong as well. What is the actual exception and stacktrace associated with that error message?

anon8675309 commented 2 years ago

I switched my configuration to not use batched mode. In batch mode, you are correct, the exception will be swallowed by the InfluxDB library and the software will never be able to detect that there was a problem and keep on blindly calling write(point) and never knowing that it's failing. The library seems to try to reconnect a few times, but then gives up.

Non-batch mode

Jul 26 13:26:09 ruuvi1.example.com java[1939]: 2022-07-26 13:26:09.814 INFO  [Config] Creating database connection for storageMethod: influxdb
Jul 26 13:26:11 ruuvi1.example.com java[1939]: 2022-07-26 13:26:11.434 INFO  [Main] Successfully reading data from hcidump
Jul 26 16:51:48 ruuvi1.example.com java[1939]: 2022-07-26 16:51:48.946 WARN  [Main] Uncaught exception while handling measurements from MAC address "F36D288E8A14", if this repeats and this is not a Ruuvitag, try blacklisting it
Jul 26 16:51:48 ruuvi1.example.com java[1939]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:800)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:430)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:408)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at fi.tkgwf.ruuvi.db.InfluxDBConnection.save(InfluxDBConnection.java:56)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at java.util.Optional.ifPresent(Optional.java:159)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at fi.tkgwf.ruuvi.service.PersistenceService.store(PersistenceService.java:33)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at java.util.Optional.ifPresent(Optional.java:159)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at fi.tkgwf.ruuvi.Main.run(Main.java:106)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at fi.tkgwf.ruuvi.Main.run(Main.java:70)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at fi.tkgwf.ruuvi.Main.main(Main.java:30)
Jul 26 16:51:48 ruuvi1.example.com java[1939]: Caused by: java.net.SocketTimeoutException: timeout
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okio.Okio$4.newTimeoutException(Okio.java:232)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
Jul 26 16:51:48 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at org.influxdb.impl.BasicAuthInterceptor.intercept(BasicAuthInterceptor.java:22)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at org.influxdb.impl.GzipRequestInterceptor.intercept(GzipRequestInterceptor.java:53)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.java:144)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okhttp3.RealCall.execute(RealCall.java:77)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at retrofit2.OkHttpCall.execute(OkHttpCall.java:180)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:788)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         ... 10 more
Jul 26 16:51:49 ruuvi1.example.com java[1939]: Caused by: java.net.SocketException: Socket closed
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at java.net.SocketInputStream.read(SocketInputStream.java:203)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at java.net.SocketInputStream.read(SocketInputStream.java:141)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at sun.security.ssl.InputRecord.read(InputRecord.java:503)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okio.Okio$2.read(Okio.java:140)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
Jul 26 16:51:49 ruuvi1.example.com java[1939]:         ... 40 more
Jul 26 16:51:49 ruuvi1.example.com java[1939]: 2022-07-26 16:51:48.972 DEBUG [Main] Offending line:   6D 28 8E 8A 14 9B

Batch mode

Jul 23 00:01:22 ruuvi1.example.com java[1286]: Jul 23, 2022 12:01:22 AM org.influxdb.impl.BatchProcessor write
Jul 23 00:01:22 ruuvi1.example.com java[1286]: SEVERE: Batch could not be sent. Data will be lost
Jul 23 00:01:22 ruuvi1.example.com java[1286]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:800)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.OneShotBatchWriter.write(OneShotBatchWriter.java:22)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.BatchProcessor.write(BatchProcessor.java:321)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.BatchProcessor$1.run(BatchProcessor.java:269)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.lang.Thread.run(Thread.java:745)
Jul 23 00:01:22 ruuvi1.example.com java[1286]: Caused by: java.net.SocketTimeoutException: timeout
Scrin commented 2 years ago

non-batch mode would explain it indeed. Perhaps there should be a try-catch in InfluxDBConnection.save() for SocketTimeoutException for a more sensible error message than the global catch in main for when non-batch mode is used

anon8675309 commented 2 years ago

My current implementation is to just add a more specific catch block in main() since that's the function where I want the behaviour to change (exit, rather than continuing to loop). The catch block could be put in InfluxDBConnection, but that would require changing InfluxDBConnection::save() to return success or failure, and PersistenceService::store() as well. Then it'd be up to the caller to check the return code of PersistenceService::store().

I'm not entirely sure how to get return values from java.util.Optional (that's a new one for me), so I'll submit my merge request as is. There's no pride in authorship here. I don't care how it's implemented so long as I'm able to recover from database SNAFUs.

Scrin commented 2 years ago

Done in #69