EarthScope / ringserver

Apache License 2.0
30 stars 16 forks source link

Clean data served with a lot of gaps #24

Closed jschaeff closed 4 years ago

jschaeff commented 4 years ago

We have a ringserver version: 2018.078 running in MSeedScan mode:

MSeedScan /home/sysop/work/out StateFile=/home/sysop/scan.state InitCurrentState=y

The directory scanned by the ringserver is actual realtime data and looks clean. For instance:

$ msi -tg work/out/2020/FR/ILLK/HHZ.D/FR.ILLK.00.HHZ.D.2020.065 
   Source                Start sample             End sample        Gap  Hz  Samples
FR_ILLK_00_HHZ    2020,065,00:00:01.755000 2020,065,15:10:36.420000  ==  200 10926934
Total: 1 trace(s) with 1 segment(s)

We observe a lot of gaps when gathering data form rtserve.resif.fr For instance:

$ slinktool -S "FR_ILLK:HHZ" -SDS /tmp rtserve.resif.fr:18000

then

$ msi -tg /tmp/2020/FR/ILLK/HHZ.D/FR.ILLK.00.HHZ.D.2020.065 |grep -v -
   Source                Start sample             End sample        Gap  Hz  Samples
FR_ILLK_00_HHZ    2020,065,00:00:01.755000 2020,065,01:11:02.205000  ==  200 852091
FR_ILLK_00_HHZ    2020,065,04:44:45.140000 2020,065,05:09:31.510000 1023 200 297275
FR_ILLK_00_HHZ    2020,065,06:26:00.510000 2020,065,06:31:34.900000 6.585 200 66879
FR_ILLK_00_HHZ    2020,065,09:15:58.205001 2020,065,09:22:22.420001 384.5 200 76844
FR_ILLK_00_HHZ    2020,065,09:42:18.045000 2020,065,10:05:51.135000 36.51 200 282619
FR_ILLK_00_HHZ    2020,065,11:04:38.240000 2020,065,11:04:39.795000 619.5 200 312
FR_ILLK_00_HHZ    2020,065,11:14:27.850000 2020,065,11:53:56.905001 588.1 200 473812
FR_ILLK_00_HHZ    2020,065,12:38:02.060001 2020,065,12:38:03.500001 191.2 200 289
FR_ILLK_00_HHZ    2020,065,12:46:57.960000 2020,065,12:48:11.985000 534.5 200 14806

Doing a lot of tests from several clients, we ruled out potential client problems. They all gather data with gaps, although the gaps are not the same. We also run a slinktool on localhost:18000 to rule out network issues. We also got gaps.

Restarting the ringserver and deleting the scan.state file does not change anything.

We found out that the packetbuf itself contains a lot of gaps, but it does not fit with the gaps collected by the clients;

$ msi -M FR_ILLK_00_HHZ -tg ~/ring/packetbuf 
<snip/>
FR_ILLK_00_HHZ    2020,065,14:51:31.885000 2020,065,14:51:33.555000 35.98 200 335
FR_ILLK_00_HHZ    2020,065,14:51:36.845000 2020,065,14:51:38.595000 3.29 200 351
FR_ILLK_00_HHZ    2020,065,14:52:41.130000 2020,065,14:52:42.930000 62.53 200 361
FR_ILLK_00_HHZ    2020,065,14:52:57.175000 2020,065,14:53:00.740000 14.24 200 714
FR_ILLK_00_HHZ    2020,065,14:53:54.709999 2020,065,14:53:56.699999 53.97 200 399
FR_ILLK_00_HHZ    2020,065,14:54:07.995000 2020,065,14:54:09.910000 11.3 200 384
FR_ILLK_00_HHZ    2020,065,14:55:19.374900 2020,065,14:55:19.374900 69.46 200 0
FR_ILLK_00_HHZ    2020,065,14:56:09.700000 2020,065,14:56:11.510000 50.33 200 363
FR_ILLK_00_HHZ    2020,065,14:56:54.209999 2020,065,14:56:55.994999 42.7 200 358
FR_ILLK_00_HHZ    2020,065,14:57:07.975000 2020,065,14:57:09.740000 11.98 200 354
FR_ILLK_00_HHZ    2020,065,14:57:55.950000 2020,065,14:57:57.875000 46.21 200 386
FR_ILLK_00_HHZ    2020,065,14:58:18.515000 2020,065,14:58:20.405000 20.64 200 379
FR_ILLK_00_HHZ    2020,065,14:59:05.150000 2020,065,14:59:06.875000 44.74 200 346
FR_ILLK_00_HHZ    2020,065,14:59:30.760000 2020,065,14:59:32.435000 23.89 200 336
FR_ILLK_00_HHZ    2020,065,15:00:05.790000 2020,065,15:00:07.680000 33.35 200 379
FR_ILLK_00_HHZ    2020,065,15:00:15.270000 2020,065,15:00:17.210000 7.59 200 389
Total: 1 trace(s) with 906 segment(s)

(complete output file attached)

Our ringbuffer size is 8GB. The age of the oldest data in the packetbuf seems to tell that it's enough.

Do you have any idea what's going wrong ? (or are the gaps in the packetbuffer normal, in which case, the problem is elsewhere ?)

chad-earthscope commented 4 years ago

The packet buffer is not plain miniSEED, and so cannot be scanned with msi or other such tools.

Does it make any difference if the client is local, on the host running ringserver?

No errors in the log file? Even with increased verbosity?

chad-earthscope commented 4 years ago

I see you tried the local connection. Hmm, not good.

No errors in the log at all?

I would suspect that something is going wrong with scanning of the data, but it's hard to tell at this point.

chad-earthscope commented 4 years ago

Note that the combination of removing the state file and InitCurrentState=y option, will cause the scanning to skip any old data on a restart of ringserver.

jschaeff commented 4 years ago

Thank you for your quick answer. I'm going to run a ringserver with -vv, on port 18002, connect a local client to it and look at the logs.

Note that we run 2 instances of ringserver (one on 18000 for public data and one on 18001 for restricted data).

I can't tell since when we have this issue, but we had some similar problems some month ago and could'nt fix it. Now I'm investigating for real :)

I'll report here when I get something new.

jschaeff commented 4 years ago

Hi Chad, could it be that our packetbuf size is too small (or too big) ? We have it at 8GB, and we scan a directory of 191GB of SDS data. almost 20000 files.

I've started a ringserver only scanning one station, and it ships clean data, no gaps detected. Over the WE, I'll let it run and report back here.

chad-earthscope commented 4 years ago

could it be that our packetbuf size is too small (or too big) ? We have it at 8GB, and we scan a directory of 191GB of SDS data. almost 20000 files.

Ring buffer size should only be a problem if it is scanning more than the buffer size in a short period of time. In your case, the buffer will only have the last 8GB of data scanned. Are you adding more than 8GB per day (or shorter period) to the SDS repository you're scanning?

Note, you cannot use a ring buffer larger than 8GB and still export data via SeedLink with the default packet size of 512 bytes. This is due to a combination of how ringserver maps SeedLink packet IDs and the maximum SeedLink packet ID of 16,777,215.

jschaeff commented 4 years ago

Good morning Chad, As for now, we ingest approx 2GB/hour and keep the data 7 days.

We tested today by collecting the realtime data from all sources in hourly rotated files, and cleaning all h-2 files. It seems to be much better. I'm going to migrate our production configuration tomorrow and hope to fix the issue.

Cheers,

jonathan

chad-earthscope commented 4 years ago

Good morning. OK, that is a relatively large flow of data, and, I am presuming, a lot of files. There is only one scanning thread per scan config option. It may be that the internal scanner cannot keep up with the combination of many, many files and data rate.

To see how long an internal scanner takes to make a recursive pass through the specified directory you can see the "Scan time" (in seconds) via the server status page at: http://localhost:18000/status

By default you may not be able to see that page from a remote host and you'll need to add trusted addresses to allow other hosts to access that page.

If the scan time is very large, many minutes, then it's probably a good idea to break up the scanning or reduce the data being scanned (as you have done).

Alternative to removing the older data as you've done, one other suggestion:

Run multiple MSeedScan options to parallelize the scanning. If you have top level directories that spilt the data set somehow, i.e. between networks, this can be used to divide the scanning between multiple threads. Be careful to use different state files for each thread. For example:

MSeedScan /data/<NET1>/  StateFile=scan.NET1.state  InitCurrentState=y
MSeedScan /data/<NET2>/  StateFile=scan.NET2.state  InitCurrentState=y
jschaeff commented 4 years ago

I enabled HTTP protocol, I didn't knew about the status page, it's very helpful, thank you ! (Now, I've read all your documentation :) ). Now, with a scan process per network, the scan time is very short. Sometimes it takes 1s, but on average it's around the millisec.

Here is how it looks like:

ringserver/2018.078
Organization: RESIF Ring Server
Server start time (UTC): 2020-03-10 14:25:20
Ring version: 1
Ring size: 8589934592
Packet size: 512
Max packet ID: 16777215
Max packets: 13591662
Memory mapped ring: TRUE
Volatile ring: FALSE
Total connections: 194
Total streams: 1931
TX packet rate: 2529.1
TX byte rate: 1294879.4
RX packet rate: 0.0
RX byte rate: 0.0
Earliest packet: 3009696
  Create: 2020-03-10 08:37:25.014504  Data start: 2020-01-24 04:34:59.910000  Data end: 2020-01-24 04:35:04.020000
Latest packet: 16601357
  Create: 2020-03-10 14:44:57.815431  Data start: 2020-03-10 14:44:53.243130  Data end: 2020-03-10 14:44:56.683130

Server threads:
  DataLink SeedLink HTTP, Port: 18000
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/3C
    Max recursion: -1
    State file: /home/sysop/work/scan.3C.state
    Match: 
    Reject: 
    Scan time: 1.00048
    Packet rate: 0
    Byte rate: 0
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/8C
    Max recursion: -1
    State file: /home/sysop/work/scan.8C.state
    Match: 
    Reject: 
    Scan time: 1.00025
    Packet rate: 0
    Byte rate: 0
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/CL
    Max recursion: -1
    State file: /home/sysop/work/scan.CL.state
    Match: 
    Reject: 
    Scan time: 0.001971
    Packet rate: 2536.78
    Byte rate: 1.29883e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/FR
    Max recursion: -1
    State file: /home/sysop/work/scan.FR.state
    Match: 
    Reject: 
    Scan time: 0.014774
    Packet rate: 270.746
    Byte rate: 138622
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/G
    Max recursion: -1
    State file: /home/sysop/work/scan.G.state
    Match: 
    Reject: 
    Scan time: 0.004674
    Packet rate: 2139.5
    Byte rate: 1.09542e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/GL
    Max recursion: -1
    State file: /home/sysop/work/scan.GL.state
    Match: 
    Reject: 
    Scan time: 0.001474
    Packet rate: 6784.26
    Byte rate: 3.47354e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/MQ
    Max recursion: -1
    State file: /home/sysop/work/scan.MQ.state
    Match: 
    Reject: 
    Scan time: 0.001219
    Packet rate: 7383.1
    Byte rate: 3.78015e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/MT
    Max recursion: -1
    State file: /home/sysop/work/scan.MT.state
    Match: 
    Reject: 
    Scan time: 0.001169
    Packet rate: 9409.75
    Byte rate: 4.81779e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/ND
    Max recursion: -1
    State file: /home/sysop/work/scan.ND.state
    Match: 
    Reject: 
    Scan time: 0.001417
    Packet rate: 4234.3
    Byte rate: 2.16796e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/PF
    Max recursion: -1
    State file: /home/sysop/work/scan.PF.state
    Match: 
    Reject: 
    Scan time: 0.005829
    Packet rate: 9950.25
    Byte rate: 5.09453e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/RA
    Max recursion: -1
    State file: /home/sysop/work/scan.RA.state
    Match: 
    Reject: 
    Scan time: 0.039066
    Packet rate: 13464.4
    Byte rate: 6.89377e+06
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/RD
    Max recursion: -1
    State file: /home/sysop/work/scan.RD.state
    Match: 
    Reject: 
    Scan time: 1.00052
    Packet rate: 0
    Byte rate: 0
  Mini-SEED Scanner
    Directory: /home/sysop/work/out/2020/WI
    Max recursion: -1
    State file: /home/sysop/work/scan.WI.state
    Match: 
    Reject: 
    Scan time: 0.002687
    Packet rate: 3349.46
    Byte rate: 1.71492e+06
jschaeff commented 4 years ago

I've run tests with multiple scan processes. It seems much better. I still have ONE messy station, in the FR network and can't get why we stream gapped data (FR_ILLK). For the records, I keep now 2 days of data, but as I understand it, a large retention should not bother the scanners. Am I right ? This is our realtime data storage size in MB:

1188    3C
304 8C
904 CL
13143   FR
1045    G
781 GL
576 MQ
1064    MT
2796    ND
1933    PF
8243    RA
168 RD
976 WI

And the number of files:

3C: 8
8C: 3
CL: 20
FR: 135
G: 33
GL: 17
MQ: 12
MT: 7
ND: 6
PF: 40
RA: 91
RD: 9
WI: 11
chad-earthscope commented 4 years ago

For the records, I keep now 2 days of data, but as I understand it, a large retention should not bother the scanners. Am I right ?

Mostly. The scanning tracks files in two (potentially three) states:

  1. Active - modified within the last 7200 seconds (120 minutes), check each scan
  2. Idle - modified more than 7200 seconds ago, check every 60th scan

So the I/O is reduced for old data significantly, mostly avoiding a stat() call. The scanning still has to hop through the directory structures to find entries. For now, these values are hard coded in ringserver.

There is a third state called Quiet, where, after a certain number of seconds of not being modified, the file will never be checked again. This is currently turned off in ringserver, but could be make configurable if you believe it would help you.

jschaeff commented 4 years ago

Thank you for your help ! We greatly enhanced the quality of the RT data by splitting the scans. The remaining gaps might come from the chaotic station transmission itself. We have plans to simplify our realtime service based on this work.

chad-earthscope commented 4 years ago

Hi @jschaeff, Note that there is a new release (https://github.com/iris-edu/ringserver/releases/tag/v2020.075) that improves SeedLink connection resuming. In some cases clients connecting and reconnecting repeatedly may get duplicate data (maybe gaps too).