Try to replicate InfluxDB sync issue

RaiBnod commented 1 year ago

https://dash.nube-iiot.com/precise-air/d/PmXrWQDaa/sap-heartbeats?orgId=1&from=1675953722225&to=1676160086990

https://user-images.githubusercontent.com/6800775/223656813-fbda68d9-bd93-48c6-abcd-c16b145d79cd.mov

The InfluxDB sync in a certain period of time is blank. But the device and point-server are up from last more than 3 months with generating the history of 15 minutes of intervals.

This could possibly happen when:

The device is offline for more than 25 hrs, so 1 hr of data gets deleted and it leaves blank history for that period or
Device or point-server gets turned off

But there is also no sign of the device went offline for that time period. Also, it's up from last more than 3 months.

So, try to replicate this case when InfluxDB goes down while writing the values. Or try to find other possible cases.

RaiBnod commented 1 year ago

The device has point-server version v2.0.7, where 200 rows of histories get stored for each point before cleaning it. So, 200 rows mean 15*200/60 = 50 hours. So, it gives the hint of the device's internet was also good.

A couple of crucial points on this are:

The InfluxDB server gets restarted at random periods of time
Different points with different frequencies of history writes are having gaps at the same time.

1. InfluxDB server logs:

root@ubuntu-s-1vcpu-1gb-sgp1-01:~# cat syslog.7 |grep restart
Mar  3 07:00:16 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 07:00:16 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6175.
Mar  3 07:12:24 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 07:12:24 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6176.
Mar  3 08:00:20 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 08:00:20 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6177.
Mar  3 08:12:25 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 08:12:25 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6178.
Mar  3 09:00:18 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 09:00:18 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6179.
Mar  3 09:12:24 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 09:12:24 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6180.
Mar  3 10:00:21 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 10:00:21 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6181.
Mar  3 10:12:34 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.
Mar  3 10:12:34 ubuntu-s-1vcpu-1gb-sgp1-01 systemd[1]: influxdb.service: Scheduled restart job, restart counter is at 6182.

2. Different frequency stored points are getting gaps at the same time:

https://user-images.githubusercontent.com/6800775/224295067-7e813136-c0c6-4948-a2f9-fb72a6a0a4a0.mov

RaiBnod commented 1 year ago

One more example to test:

https://github.com/NubeIO/rubix-point-server/assets/6800775/44fd9800-36ee-4431-8f48-e0f85f97a97e

data.txt [change it into data.db]

RaiBnod commented 1 year ago

From ChatGPT:

In influxdb 1.x, what happens if it gets restarted just after we insert the record successfully? It's not persisting the value.

In InfluxDB 1.x, if you insert a record successfully and the server restarts immediately afterward, the inserted data may not be persisted to disk, and therefore it may appear to be lost. This is because InfluxDB follows a write-ahead log (WAL) mechanism for data ingestion.

Here's how the process generally works:

When you insert data into InfluxDB, the data is first written to a Write-Ahead Log (WAL). This log is an append-only file that stores incoming writes before they are flushed to the database files on disk.
Periodically, or when the WAL reaches a certain size, InfluxDB performs a process called "compaction" where it takes the data from the WAL and flushes it into the main database files. This is when the data becomes persistent.
If the server restarts before the data in the WAL is compacted and flushed to disk, the data in the WAL will be lost, and the inserted data will not be available after the restart.

To minimize data loss in such scenarios, you can configure the following settings in the influxdb.conf configuration file:

wal-fsync-delay or wal-fsync-interval: This setting controls how often the data in the WAL is synced to disk. Lowering this value reduces the risk of data loss but may impact performance. Keep in mind that setting it too low can also cause increased disk I/O.
wal-dir: You can configure the path to the directory where the Write-Ahead Log is stored. Ensure that this directory is on a reliable storage medium to minimize the risk of data loss.
cache-snapshot-write-cold-duration: This setting controls how long data must be in the cache before it's flushed to the WAL. Reducing this value decreases the time data spends in the cache, reducing the risk of loss.
commit-timeout: This setting controls the maximum time InfluxDB will wait for a WAL flush before returning an error to the client. Lowering this value can reduce the potential data loss window but may also increase the risk of write failures if the flush cannot complete in time.

Please note that even with these settings, there is always some level of risk of data loss during a restart or crash scenario. Therefore, it is crucial to have appropriate backup and high availability strategies in place to safeguard against data loss in critical environments. In InfluxDB 2.x, there are improvements and changes in the data storage mechanism, and it is recommended to consider using the latest version if possible.

NubeIO / rubix-point-server