Scrin / RuuviCollector

Utility to collect measurements from RuuviTags and store them in InfluxDB
MIT License
125 stars 62 forks source link

Crash on Utils.java #17

Closed fobin closed 5 years ago

fobin commented 5 years ago

FA624F5E-AFFD-46F4-B22F-52DEC43A7656

Scrin commented 5 years ago

This happens when hcidump manages to output invalid data, it's mostly harmless (apart from "unnecessary garbage" in the logs) and the collector is able to discard that and resume normal operation automatically.

Though it seems the current "invalid data detection" logic does not properly detect invalid data in cases where the header data is present but the (remote) mac address is not, this needs fixing.

fobin commented 5 years ago

At least for me this requires a restart on the collector service. It's weird that this only seems to have started lately and I haven't updated collector. Raspberry Pi I've updated.

Sent with GitHawk

Scrin commented 5 years ago

Hmm, that sounds like there's something else wrong then too, unless there's more in the log after the "Offending line: " debug print. When this happens the next time and the collector needs restarting, could you check whether hcidump and hcitool processes are running (before restarting the collector)? ps -ef | grep hci or something, there should be both hcitool lescan --duplicates --passive and hcidump --raw. I suspect the hcitool somehow dies, as the collector does not watch for that process after it has been successfully started. If that's the case, then there's another improvement to be made.

fobin commented 5 years ago

I'll do that.

Sent with GitHawk

fobin commented 5 years ago

12E1C416-DD1C-4202-A6FA-DD5BBFE178BB Both seem to be there.

fobin commented 5 years ago

Finally got to computer and saw the whole error.

Apr

24 18:28:11 raspberrypi java[30483]: 2019-04-24 18:28:11.151 DEBUG [Main] Offending line: > 04 0E 05 01 0F 20 00 80 Apr 24 18:28:11 raspberrypi java[30483]: 2019-04-24 18:28:11.152 WARN [Main] Uncaught exception while handling measurements from MAC address "000003130202", if this repeats and this is not a Ruuvitag, consider blacklisting it Apr 24 18:28:11 raspberrypi java[30483]: java.lang.ArrayIndexOutOfBoundsException: 5 Apr 24 18:28:11 raspberrypi java[30483]: at fi.tkgwf.ruuvi.utils.Utils.getMacFromLine(Utils.java:53) Apr 24 18:28:11 raspberrypi java[30483]: at fi.tkgwf.ruuvi.Main.run(Main.java:98) Apr 24 18:28:11 raspberrypi java[30483]: at fi.tkgwf.ruuvi.Main.main(Main.java:45) Apr 24 18:28:11 raspberrypi java[30483]: 2019-04-24 18:28:11.153 DEBUG [Main] Offending line: > 04 0E 06 01 12 0C 00 00 00

It's not my Ruuvi tag so I'll try to blacklist it.

Scrin commented 5 years ago

interesting :thinking: I wonder if the scanning stops for some reason without the hcitool process dying.. could you try to run hcidump --raw (as the pi user which you use to run the collector) next time this happens to see if you get any data output before restarting the collector? And if you do get output (unlikely) while the collector is not working, it would be nice if you could give a sample of the raw data to ensure it actually contains messages from your tags. You can try to run the dump command while the collector is working to see what kind of data should be present.

fobin commented 5 years ago

Yeah I get output from that. Still crashes.

04 3E 1A 02 01 00 01 00 22 1C 6A B1 5E 0E 02 01 1A 0A FF 4C 00 10 05 01 98 DE D4 20 B1 04 3E 21 02 01 03 01 17 C5 C6 A7 59 DF 15 02 01 06 11 FF 99 04 03 2F 19 16 CB 98 FF EB 00 03 04 14 0B 83 AB 04 3E 27 02 01 00 01 C0 25 C0 E3 D7 E3 1B 02 01 06 12 FF 4C 00 06 6D 00 C8 D8 92 86 52 DD 07 00 F8 00 01 02 04 08 45 76 65 A0 04 3E 2B 02 01 00 01 E0 98 57 33 31 F6 1F 02 01 06 16 FF 4C 00 06 31 00 50 38 3A D8 17 F8 07 00 59 01 02 02 24 EF 7F 19 04 08 45 76 65 AA 04 3E 1A 02 01 00 01 6C 90 2A 32 E0 46 0E 02 01 1A 0A FF 4C 00 10 05 1A 1C 24 55 65 9D 04 3E 1A 02 01 00 01 34 58 B0 4A D6 59 0E 02 01 06 0A FF 4C 00 10 05 01 18 FC 4D D2 BF 04 3E 1B 02 01 03 01 C8 BE 61 E3 A1 1B 0F 02 01 1B 0B FF 4C 00 09 06 03 8A 0A 00 01 18 B5 04 3E 1A 02 01 00 01 10 B1 F8 DF 06 5C 0E 02 01 1A 0A FF 4C 00 10 05 07 1C F3 24 2E AE 04 3E 1A 02 01 00 00 53 68 E1 C0 CB 40 0E 02 01 1A 0A FF 4C 00 10 05 01 14 F4 25 53 A4 04 3E 2A 02 01 00 01 CA 43 BF 29 82 C7 1E 02 01 06 12 FF 4C 00 06 6D 00 BC CF A2 EB EB 98 05 00 3A 03 01 02 07 09 41 36 30 20 43 65 BA 04 3E 2A 02 01 00 01 33 09 AE D4 66 C5 1E 02 01 06 12 FF 4C 00 06 6D 00 48 6A 0F 86 EA 4E 05 00 38 00 01 02 07 09 41 36 30 20 43 65 AD 04 3E 27 02 01 00 01 C0 25 C0 E3 D7 E3 1B 02 01 06 12 FF 4C 00 06 6D 00 C8 D8 92 86 52 DD 07 00 F8 00 01 02 04 08 45 76 65 A2 04 3E 21 02 01 03 01 07 59 81 8A 5F C8 15 02 01 06 11 FF 99 04 03 36 17 2B CB 9B FF F5 FF F8 04 0F 0B B3 AD 04 3E 2B 02 01 00 01 E0 98 57 33 31 F6 1F 02 01 06 16 FF 4C 00 06 31 00 50 38 3A D8 17 F8 07 00 59 01 02 02 24 EF 7F 19 04 08 45 76 65 AA 04 3E 1A 02 01 00 01 34 58 B0 4A D6 59 0E 02 01 06 0A FF 4C 00 10 05 01 18 FC 4D D2 C1 04 3E 21 02 01 03 01 C0 1F 18 67 76 D6 15 02 01 06 11 FF 99 04 03 3A 16 18 CB 58 FF E2 00 13 03 F2 0B 9B AA 04 3E 1A 02 01 00 01 00 22 1C 6A B1 5E 0E 02 01 1A 0A FF 4C 00 10 05 01 98 DE D4 20 A1 04 3E 1B 02 01 03 01 C8 BE 61 E3 A1 1B 0F 02 01 1B 0B FF 4C 00 09 06 03 8A 0A 00 01 18 A4 04 3E 2A 02 01 00 01 CA 43 BF 29 82 C7 1E 02 01 06 12 FF 4C 00 06 6D 00 BC CF A2 EB EB 98 05 00 3A 03 01 02 07 09 41 36 30 20 43 65 C4

fobin commented 5 years ago

It gave a new btaddr again to this:

Uncaught exception while handling measurements from MAC address "E3D7E3C025C0", if this repeats and this is not a Ruuvitag, consider blacklisting it

Scrin commented 5 years ago

I've created a new release v0.2.4 which includes a large number of various other changes, fixes and improved logging. Could you try that version to see if the issue still persists? And if the error message(s) are any different.

fobin commented 5 years ago

Installed the new version. I'll report back here.

fobin commented 5 years ago

Now I got.

May 02 09:29:07 raspberrypi java[24231]: May 02, 2019 9:29:02 AM org.influxdb.impl.BatchProcessor write May 02 09:29:07 raspberrypi java[24231]: SEVERE: Batch could not be sent. Data will be lost May 02 09:29:07 raspberrypi java[24231]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:800) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.OneShotBatchWriter.write(OneShotBatchWriter.java:22) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.BatchProcessor.write(BatchProcessor.java:321) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.BatchProcessor$1.run(BatchProcessor.java:269) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.FutureTask.run(FutureTask.java:266) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) May 02 09:29:07 raspberrypi java[24231]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) May 02 09:29:07 raspberrypi java[24231]: at java.lang.Thread.run(Thread.java:745) May 02 09:29:07 raspberrypi java[24231]: Caused by: java.net.SocketTimeoutException: timeout May 02 09:29:07 raspberrypi java[24231]: at okio.Okio$4.newTimeoutException(Okio.java:232) May 02 09:29:07 raspberrypi java[24231]: at okio.AsyncTimeout.exit(AsyncTimeout.java:285) May 02 09:29:07 raspberrypi java[24231]: at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) May 02 09:29:07 raspberrypi java[24231]: at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) May 02 09:29:07 raspberrypi java[24231]: at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.BasicAuthInterceptor.intercept(BasicAuthInterceptor.java:22) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.GzipRequestInterceptor.intercept(GzipRequestInterceptor.java:53) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.java:144) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) May 02 09:29:07 raspberrypi java[24231]: at okhttp3.RealCall.execute(RealCall.java:77) May 02 09:29:07 raspberrypi java[24231]: at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) May 02 09:29:07 raspberrypi java[24231]: at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:788) May 02 09:29:07 raspberrypi java[24231]: ... 11 more May 02 09:29:07 raspberrypi java[24231]: Caused by: java.net.SocketTimeoutException: Read timed out May 02 09:29:07 raspberrypi java[24231]: at java.net.SocketInputStream.socketRead0(Native Method) May 02 09:29:07 raspberrypi java[24231]: at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) May 02 09:29:07 raspberrypi java[24231]: at java.net.SocketInputStream.read(SocketInputStream.java:170) May 02 09:29:07 raspberrypi java[24231]: at java.net.SocketInputStream.read(SocketInputStream.java:141) May 02 09:29:07 raspberrypi java[24231]: at okio.Okio$2.read(Okio.java:140) May 02 09:29:07 raspberrypi java[24231]: at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) May 02 09:29:07 raspberrypi java[24231]: ... 41 more

Scrin commented 5 years ago

That's a timeout from connecting to InfluxDB. Are you running InfluxDB locally or on another machine/server/cloud?

fobin commented 5 years ago

Locally on the Pi. I'll restart to see if it happens again but first time I see that.

Scrin commented 5 years ago

Interesting. Have you checked whether you have enough free disk space and free memory, and that there's nothing hogging enormous amounts of CPU? These issues start to sound a bit like the issue is somewhere else, something broken or even damaged filesystem

fobin commented 5 years ago

/dev/root 15G 3.8G 9.8G 28% / it has free memory on the card is plenty but it's a memory card still so there might be something wonky on it. At least now after restart it has worked for a while.

DG12 commented 5 years ago

Notice that the most recent error " SEVERE: Batch could not be sent. Data will be lost" is not the same as the initial post "uncaught exception"

I have the "SEVERE: Batch could not be sent." ...timeout... problems on a raspberry pi zero W on a daily basis. Oddly frequently just after midnight . I am checking up with the influxdb community regarding this. There are other reported seemingly related issues. I also SOMETIMES see

kern.err kernel: INFO: task kworker/0:2:20918 blocked for more than 120 seconds. kern.err kernel: Tainted: G C 4.14.98+ #1200 kern.err kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kern.info kernel: kworker/0:2 D 0 20918 2 0x00000000
kern.info kernel: Workqueue: events_freezable mmc_rescan kern.warning kernel: [] (schedule) from [] (schedule+0x50/0xa8) kern.warning kernel: [] (schedule) from [] (mmc_claim_host+0x98/0x210) kern.warning kernel: [] (__mmc_claim_host) from [] (mmc_get_card+0x30/0x34) kern.warning kernel: [] (mmc_get_card) from [] (mmc_sd_detect+0x20/0x70) kern.warning kernel: [] (mmc_sd_detect) from [] (mmc_rescan+0x288/0x490) kern.warning kernel: [] (mmc_rescan) from [] (process_one_work+0x11c/0x398) kern.warning kernel: [] (process_one_work) from [] (worker_thread+0x3c/0x544) kern.warning kernel: [] (worker_thread) from [] (kthread+0x120/0x15c) kern.warning kernel: [] (kthread) from [] (ret_from_fork+0x14/0x28)

I run vmstat in a cron and see stuff like: procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 8 0 21504 84 35108 0 0 29 19 80 48 19 1 79 1 0

I have /etc/sysctl.conf vm.swappiness = 5 vm.dirty_background_ratio = 5 vm.dirty_ratio = 80

Interesting that influxdb never seems to complain ++++++++++++++++++++++++++++++++++++++++++++++ Scrin: I haven't checked your most recent code but could you revise the priority on this to be err (instead of info) and omit the stack trace (since it doesn't really help and clutters up the log) Thank you.

Scrin commented 5 years ago

Scrin: I haven't checked your most recent code but could you revise the priority on this to be err (instead of info) and omit the stack trace (since it doesn't really help and clutters up the log) Thank you.

Priority of what is info? :thinking:

I can create a custom catch and logging for the timeout issue to get rid of the stacktrace

DG12 commented 5 years ago

I thought you coded the log entries:


> 2019-05-03 00:11:21 pi93graf user.info RuuviCollector[361]: May 03, 2019 12:11:19 AM org.influxdb.impl.<b>BatchProcessor write</b>
> 2019-05-03 00:11:22 pi93graf user.info RuuviCollector[361]: SEVERE: Batch could not be sent. Data will be lost
> 2019-05-03 00:11:22 pi93graf user.info RuuviCollector[361]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout
> 2019-05-03 00:11:22 pi93graf user.info RuuviCollector[361]: #011at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:514)
> 2019-05-03 00:11:22 pi93graf user.info RuuviCollector[361]: #011at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:312)
> ...
> 2019-05-03 00:11:22 pi93graf user.info RuuviCollector[361]: #011at java.lang.Thread.run(Thread.java:745)
> 2019-05-03 00:11:23 pi93graf user.info RuuviCollector[361]: Caused by: java.net.SocketTimeoutException: timeout
> 20
> 
Scrin commented 5 years ago

that's actually on "severe" level, not info, and originates from the InfluxDB library itself which does its own default error handling and logging right now

DG12 commented 5 years ago

Oops. syslog doesn't even have a "severe" level so it comes out at info. Not a biggie, just a nuisance. Thank you.

Have you noticed Batch could not be sent." ...timeout problems in your environment? Any idea why these might frequently occur just after midnight?

DG12 commented 5 years ago

This morning's 2 sets of issues: grep SEVER user.info |cut -c11-20|column -c 80 00:11:22 00:14:58 00:20:03 00:28:03 05:12:30 00:11:39 00:16:20 00:20:18 00:28:03 05:12:40 00:12:05 00:16:36 00:20:33 00:28:04 05:13:06 00:12:21 00:16:47 00:21:42 00:28:04 05:13:17 00:12:37 00:17:00 00:21:55 00:28:05 05:13:28 00:12:54 00:17:27 00:22:06 05:10:40 05:13:58 00:13:09 00:17:38 00:22:18 05:11:22 05:14:09 00:13:28 00:17:49 00:22:47 05:11:23 05:14:19 00:13:59 00:18:03 00:23:58 05:11:33 05:14:31 00:14:10 00:18:44 00:26:34 05:11:44 05:14:41 00:14:22 00:19:04 00:28:02 05:11:55 05:17:28 00:14:32 00:19:41 00:28:02 05:12:07 05:18:03 00:14:46 00:19:47 00:28:03 05:12:18 /var/log >

Thanks again

fobin commented 5 years ago

For me it’s been running just fine for the last days.

Scrin commented 5 years ago

So some of the changes may have fixed this issue, somehow. Lets hope that's the case, I'll close this for now, and reopen this if the issue appears again.

anon8675309 commented 2 years ago

Wow, reading this thread has made me realize that I'm not crazy; this has happened to someone before! I am sad to report that it's happening to me in the 0.2.7 release. Right down the the "requiring a restart", and then once it's fixed I got an InfluxDB error.

Out of Bounds Read

This is the fix that I used to deal with the out of bounds issue in this thread:

--- RuuviCollector-0.2.7/src/main/java/fi/tkgwf/ruuvi/utils/Utils.java 2021-07-26 10:59:55.000000000 -0500
+++ RuuviCollector-0.2.7-modified/src/main/java/fi/tkgwf/ruuvi/utils/Utils.java        2022-07-22 21:49:48.767713994 -0500
@@ -61,6 +61,9 @@
         }

         String[] terms = line.split (" ");
+        if (terms.length <= 13) {
+            return null;
+        }
         StringBuilder sb = new StringBuilder();
         for (int i = 13; i >= 8; i--) {
             sb.append(terms[i]);

I can make a proper pull request when I get time, or feel free to just copy in those three lines if you'd like.

Also, to speak to the root cause of this issue, this problem arose as soon as I got a new housemate. My suspicion is that they have some device (likely Apple) that is broadcasting some funky bluetooth beacons that are messing things up. Looking at the code, I'm afraid I don't have a test case to contribute, which is a shame because I'm not sure what line of data could possible cause this issue. At any rate, the patch above fixes it.

Database Issue

In my setup, the database is not on the raspberry pi, it's on a dedicated VM. At midnight, the logs roll over for all VMs at the same time, and it spikes the I/O, causing some delays (an issue I've been meaning to address but just haven't gotten to yet). This is what caused the database to be temporarily unavailable.

Here's what that looks like in the logs:

Jul 23 00:01:22 ruuvi1.example.com java[1286]: Jul 23, 2022 12:01:22 AM org.influxdb.impl.BatchProcessor write
Jul 23 00:01:22 ruuvi1.example.com java[1286]: SEVERE: Batch could not be sent. Data will be lost
Jul 23 00:01:22 ruuvi1.example.com java[1286]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:800)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.OneShotBatchWriter.write(OneShotBatchWriter.java:22)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.BatchProcessor.write(BatchProcessor.java:321)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at org.influxdb.impl.BatchProcessor$1.run(BatchProcessor.java:269)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jul 23 00:01:22 ruuvi1.example.com java[1286]:         at java.lang.Thread.run(Thread.java:745)
Jul 23 00:01:22 ruuvi1.example.com java[1286]: Caused by: java.net.SocketTimeoutException: timeout

This is appropriate, really. The problem is that collection then ceases for all of time. RuuviCollector doesn't not exit, nor does it attempt to reconnect to the database. I found issue #8 that was opened about this issue. I see the decision was that it should do an unclean exit, which sounds great. The issue was closed, but I don't see a merge request, so I can't find the patch that fixed that issue to see if it made it into a release.

So I tried finding the code myself. Since I'm not familiar with the code base, I'll document what I found so it is easier to see where I got off track. It looks like Main.java:106 calls PersistenceService::store. When using InfluxDB, this results in a call to InfluxDB::write, which puts this in the queue for the batchProcessor.

Coming from the other side, I can see that it was the write method is what is catching the exception. This means it's effectively being swallowed, as this is a void method and nothing is re-thrown. This is arguably a design flaw on InfluxDB's part, as the caller should have some way of at least detecting failed writes, if not being forced to deal with it (e.g., via an exception). However, I'm more interested in fixing our immediate problem than trying to get the library changed.

To work around this immediately, it appears we can set batchEnabled to be false, which will cause a write failure to raise an exception. This will likely get caught up in the inner catch in our Main function, which will just spam the logs but never exit uncleanly, but I'll wait until I see that error before I add any additional error handling to run().

Seeing as how I'm now way off topic for this thread, I'll most certainly open up a new issue in the event that I want to change the error handling to bail at the first write failure.

anon8675309 commented 2 years ago

I saw my first database timeout and can confirm that the data does not stop coming into the database (at least, it didn't this time).

Here's what it looks like in the logs:

Jul 23 14:08:28 ruuvi1.example.com java[8433]: 2022-07-23 14:08:28.152 INFO  [Main] Successfully reading data from hcidump                                                                                                    
Jul 23 17:43:30 ruuvi1.example.com java[8433]: 2022-07-23 17:43:30.744 WARN  [Main] Uncaught exception while handling measurements from MAC address "C681BCFCC323", if this repeats and this is not a Ruuvitag, try blacklisting it                                                                                                           
Jul 23 17:43:30 ruuvi1.example.com java[8433]: org.influxdb.InfluxDBIOException: java.net.SocketTimeoutException: timeout                                                                                                     
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:800)                                                                                                       
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:455)                                                                                                         
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:430)                                                                                                         
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:408)                                                                                                         
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at fi.tkgwf.ruuvi.db.InfluxDBConnection.save(InfluxDBConnection.java:56)                                                                                               
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at java.util.Optional.ifPresent(Optional.java:159)     
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at fi.tkgwf.ruuvi.service.PersistenceService.store(PersistenceService.java:33)                                                                                         
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at java.util.Optional.ifPresent(Optional.java:159)     
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at fi.tkgwf.ruuvi.Main.run(Main.java:107)              
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at fi.tkgwf.ruuvi.Main.run(Main.java:70)               
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at fi.tkgwf.ruuvi.Main.main(Main.java:30)              
Jul 23 17:43:30 ruuvi1.example.com java[8433]: Caused by: java.net.SocketTimeoutException: timeout            
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at okio.Okio$4.newTimeoutException(Okio.java:232)      
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at okio.AsyncTimeout.exit(AsyncTimeout.java:285)       
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)     
Jul 23 17:43:30 ruuvi1.example.com java[8433]:         at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)