SeisComP3 / seiscomp3

SeisComP is a seismological software for data acquisition, processing, distribution and interactive analysis.
Other
111 stars 88 forks source link

Potential problem with seiscomp3 fdsnws module #163

Closed rannof closed 6 years ago

rannof commented 6 years ago

Hi, I'm not sure if its a feature or bug nor if it is a fixable issue with obspy or seiscomp3. When using obspy to download stream data from an fdsnws server spawned by seiscomp3 (seiscomp3-jakarta-2017.334.01-centos7-x86_64), the server is hanged with CLOSE_WAIT on the ports. After several reads, the server reaches its client limit and freeze. seiscomp status shows it works OK. ============== example of client side ================

from obspy.clients.fdsn import Client as fdsnclient
.
.
.
def get_data(self):
    for tr_id in tr_id_list: # loop over several trace ids
        self.flink = fdsnclient(self.fdsnserver, timeout=30)
        stream = self.flink.get_waveforms(tr_id.network, tr_id.station, tr_id.location, tr_id.channel, self.starttime, self.endtime)
    .
    .
    .
(exit python script, re-run it after a while)

==============================================

=========== output on server ============


> netstat -a | grep 8181
tcp       27      0 0.0.0.0:8181            0.0.0.0:*               LISTEN     
tcp      230      0 ws-fdsn:8181            clientip:54342 CLOSE_WAIT 
tcp      611      0 ws-fdsn:8181             clientip:58324 CLOSE_WAIT
.
.
.
(similar entries as the limit of clients on the module)

====================================

A possible workaround (not sure it really solves the problem): ============== No problem code ==========

def get_data(self):
    from obspy.clients.fdsn import Client as fdsnclient # import within the function)
    for tr_id in tr_id_list: # loop over several trace ids
        self.flink = fdsnclient(self.fdsnserver, timeout=30)
        stream = self.flink.get_waveforms(tr_id.network, tr_id.station, tr_id.location, tr_id.channel, self.starttime, self.endtime)
    del self.flink # delete the client object

=====================================

ObsPy client version, Python version and Platform (Windows, OSX, Linux ...) obspy v 1.0.2 (possibly also older and newer versions), osx, linux (windows not tested) How did you install ObsPy and Python (pip, anaconda, from source ...) pip, anaconda If possible please supply a short, self contained, correct example that demonstrates the issue. If this is a regression (used to work in an earlier version of ObsPy), please note when it used to work. Do not know. similar issue was added to obspy github page: #2078.

gempa-jabe commented 6 years ago

What record source have you configured with your fdsnws?

rannof commented 6 years ago

global.cfg:

# Name of the recordstream service implementation.
recordstream.service = combined
# Service specific parameters like a IP address or a file name to use.
recordstream.source = slink/127.0.0.1:18000;sdsarchive/[PATHTOARCHIVE]

fdsnws.cfg:

# localhost.
listenAddress = 0.0.0.0
# Number of maximum simultaneous requests.
connections = 200
recordBulkSize = 1024
gempa-jabe commented 6 years ago

You should never use a realtime stream with fdsnws. The Seedlink connection will block until data for requested channels are available. That does not play well with fdsnws. Use exclusively sdsarchive or slink with options ?timeout=1&retries=0.

rannof commented 6 years ago

I'll try that. Thanks!

gempa-jabe commented 6 years ago

I would like to ask you to post non developer questions in the forums at https://forum.seiscomp3.org next time. Thank you.

gempa-jabe commented 6 years ago

Related to obspy/obspy#2078.

rannof commented 6 years ago

Sure, I thought it is development related.

petrrr commented 6 years ago

Are there any news on this issue?

We at our datacenter observe a serious service stability problem with fdsnws dataselect. The problem appeared after the update of Seiscomp3 to version 2017.334.03. Since then the service hangs regularly, creating a disservice, requiring a restart of the service (only FDSN).

Our problem seems to be exactly the one reported here.

gempa-jabe commented 6 years ago

The answer has been posted above. In short: don't use blocking streaming sources such as Seedlink for FDSNWS.

rannof commented 6 years ago

I can confirm that after adding the retry and timeout parameters to the seedlink server settings, no hangouts were reported.

Ran Novitsky Nof (PhD)

Seismologist

Seismological Division

The Geological Survey of Israel

Web: http://ran.rnof.info/

On Mar 1, 2018 10:49 AM, "Jan Becker" notifications@github.com wrote:

The answer has been posted above. In short: don't use blocking streaming sources such as Seedlink for FDSNWS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SeisComP3/seiscomp3/issues/163#issuecomment-369519179, or mute the thread https://github.com/notifications/unsubscribe-auth/AKICcvLULTVFOD74_b5xjbk8rGtgIsiLks5tZ7Y2gaJpZM4SKS-v .

petrrr commented 6 years ago

I was quite sure we have no blocking streams configured and I rechecked. Most of the stuff below is commented, the active one is a normal Posix filesystem.

#recordstream.service = sdsarchive
#recordstream.source = /home/sysop/seiscomp3/var/lib/archive

# ok per verce
recordstream.service = sdsarchive
recordstream.source = /mnt/seedstore_nfs/archive

#recordstream.service = slink
#recordstream.source = //

#recordstream.service = combined
#recordstream.source = slink//;sdsarchive//mnt/seedstore_nfs/archive

#recordstream.source = sdsarchive//mnt/seedstore_nfs/archive;sdsarchive//home/sysop/seiscomp3/var/lib/archive??rtMax=1800
petrrr commented 6 years ago

Looks like a corrupt file might cause the problem, however the archive has not changed with the update and the service was resilient before.

cat /home/sysop/.seiscomp3/log/fdsnws.log 
2018/03/01 08:04:37 [debug/SDSARCHIVE] SDS request: 2016,08,31,21,54,40 2016,09,01,03,04,41 IV BDI BHE
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.244 (init:1)
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.245 (not found)
2018/03/01 08:04:37 [debug/SDSARCHIVE] SDS request: 2016,08,31,21,54,40 2016,09,01,03,04,41 IV BDI BHE
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.244 (init:1)
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.245 (not found)
2018/03/01 08:04:37 [debug/SDSARCHIVE] SDS request: 2016,08,31,21,54,40 2016,09,01,03,04,41 IV BDI BHE
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.244 (init:1)
2018/03/01 08:04:37 [debug/SDSARCHIVE] + /mnt/archive_aufs/2016/IV/BDI/BHE.D/IV.BDI..BHE.D.2016.245 (not found)
2018/03/01 08:04:37 [debug/SDSARCHIVE] SDS request: 2016,08,31,21,54,40 2016,09,01,03,04,41 IV BDI BHE
gempa-jabe commented 6 years ago

Please check the CPU load and if possible, upload the corrupt file for further inspection.

petrrr commented 6 years ago

Here the file:

The hypothesis is that the error handling might cause some infinite loop. But we have no clue which piece of code this corresponds to.

gempa-jabe commented 6 years ago

I have an idea. Can you check the CPU load? I suspect that fdsnws runs at 100%.

gempa-jabe commented 6 years ago

I can confirm the issue and am working on a fix.

gempa-jabe commented 6 years ago

I have committed a patch: d6d19a435c9ab04a49e728d20f0c141d0bcac851. You can apply it to the current release source (it maintains the ABI) and give it a try. I have successfully tested it with your file.

vlauciani commented 6 years ago

Dear all

I recompiled our version 2017.334.03, using the patch on ./src/trunk/libs/seiscomp3/io/recordstream/sdsarchive.cpp file; to test in production server which binary I have to substitute?

gempa-jabe commented 6 years ago

You have to substitute libseiscomp3_core.so and restart fdsnws.

vlauciani commented 6 years ago

Hi Jan

When I substitute the library libseiscomp3_core.so, running seiscomp stop and seiscomp start I receive this error message:

$ seiscomp stop
error: /home/sysop/seiscomp3/etc/init/arclinkproxy.py: No module named _System
error: /home/sysop/seiscomp3/etc/init/trunk.py: No module named _System
error: /home/sysop/seiscomp3/etc/init/arclink.py: No module named _Core
slmon is not running
shutting down slarchive
shutting down seedlink
scwfparam is not running
scvsmaglog is not running
scvsmag is not running
scvoice is not running
scsohlog is not running
screloc is not running
scqc is not running
scproclat is not running
scmag is not running
scm is not running
scimport is not running
scimex is not running
scevtlog is not running
scevent is not running
scenvelope is not running
scdb is not running
scautopick is not running
scautoloc is not running
scamp is not running
scalert is not running
ql2sc is not running
fdsnws is not running
ew2sc3 is not running
scmaster is not running
shutting down spread
$
jsaul commented 6 years ago

When I substitute the library libseiscomp3_core.so, running seiscomp stop and seiscomp start I receive this error message:

Which SC3 version are you using?

vlauciani commented 6 years ago

Which SC3 version are you using?

the 2017.334.03. Should I use the 2017.334.04?

gempa-jabe commented 6 years ago

You can check with ldd _System.so and _Core.so (in seiscomp3/lib/python/seiscomp3) if libseiscomp3_core.so can be found. Both patch releases are binary compatible, at least they are supposed to be.

You can also check the output of e.g.

seiscomp exec scsohlog --debug

or any other Python app. It should print why it fails. I have currently not concrete idea.

vlauciani commented 6 years ago

This is the output, after the patch is applied:

gempa-jabe commented 6 years ago

Are you sure that you compiled the lib for the correct OS? It looks like a binary issue with your installation and the library you copied over.

vlauciani commented 6 years ago

Are you sure that you compiled the lib for the correct OS? It looks like a binary issue with your installation and the library you copied over.

Hi Jan, thank you for you feedback, I found the compilation problem. Now seems to be ok and the library is in production; I'll test it for a few days and then I let you know.

vlauciani commented 6 years ago

Hi Jan

The dataselect service works perfectly; thank you for the patch. I think you can close the issue.

Valentino

gempa-jabe commented 6 years ago

I am glad to hear that. Thanks for the feedback.

petrrr commented 6 years ago

@gempa-jabe: We again observe regular hangs. The file is now a different one, but the behavior seems to be rather similar. It looks like some routine enters an infinite loop.

Here the output of the log:

cat /home/sysop/.seiscomp3/log/fdsnws.log
[...]
2018/03/14 08:46:39 [debug/SDSARCHIVE] SDS request: 2010,01,06,16,48,30 2010,01,06,16,54,31 IV ARVD BHE 
2018/03/14 08:46:39 [debug/SDSARCHIVE] + /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006 (init:1)
2018/03/14 08:46:39 [debug/SDSARCHIVE] SDS request: 2010,01,06,16,48,30 2010,01,06,16,54,31 IV ARVD BHE 
2018/03/14 08:46:39 [debug/SDSARCHIVE] + /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006 (init:1)
2018/03/14 08:46:39 [debug/SDSARCHIVE] SDS request: 2010,01,06,16,48,30 2010,01,06,16,54,31 IV ARVD BHE 
2018/03/14 08:46:39 [debug/SDSARCHIVE] + /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006 (init:1)
2018/03/14 08:46:39 [debug/SDSARCHIVE] SDS request: 2010,01,06,16,48,30 2010,01,06,16,54,31 IV ARVD BHE 
2018/03/14 08:46:39 [debug/SDSARCHIVE] + /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006 (init:1)
[...]

The referenced file seems to be corrupt:

sysop@eida:~$ qmerge -n /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006
partial block of 163 bytes at end of /mnt/archive_aufs/2010/IV/ARVD/BHE.D/IV.ARVD..BHE.D.2010.006
ARVD.IV.BHE. rate=20 (2010.006 00:00:00.8450 to 2010.006 00:13:24.1452) : 16066 points, 0.2 msec correction, (min,max,max_step = 0.0,0.3,0.2 msec)

Should we reopen this issue or open a new one?

gempa-jabe commented 6 years ago

I can have a look if you attach the file.

petrrr commented 6 years ago

Sure! I just waited to understand if we continue here.

Here the file, causing the new problem.

IV.ARVD..BHE.D.2010.006.zip

gempa-jabe commented 6 years ago

Works for me. Are you sure that you have applied the patch?

gempa-jabe commented 6 years ago

I closed the issue because the behaviour you described was identical to the initial issue. If you still observe hangs and if you can confirm that you applied the fix, then please reopen the issue.

petrrr commented 6 years ago

@gempa-jabe: According to internal communication of my colleague @vlauciani who took are of this the patch was applied. I'll check back with him and try to understand better and we let you know.

vlauciani commented 6 years ago

Hi @gempa-jabe

The patch was applied and the "original" hang was solved; now we have an other error that hang the service; the ~/.seiscomp3/log/fdsnws.log is completely full with lines like this:

. . .
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
2018/03/22 07:48:47 [error/SDSARCHIVE] file read exception: Invalid miniSEED header
. . .

I cannot understand which MSEED (or request) generate this error beacuse all logs availabe are full of these lines:

$ ll ~/.seiscomp3/log/fdsnws*
-rw-r--r-- 1 sysop sysop  21020503 Mar 22 08:00 /home/sysop/.seiscomp3/log/fdsnws.log
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:48 /home/sysop/.seiscomp3/log/fdsnws.log.1
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:48 /home/sysop/.seiscomp3/log/fdsnws.log.2
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:47 /home/sysop/.seiscomp3/log/fdsnws.log.3
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:47 /home/sysop/.seiscomp3/log/fdsnws.log.4
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:46 /home/sysop/.seiscomp3/log/fdsnws.log.5
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:46 /home/sysop/.seiscomp3/log/fdsnws.log.6
-rw-r--r-- 1 sysop sysop 104857620 Mar 22 07:45 /home/sysop/.seiscomp3/log/fdsnws.log.7
$

Any idea?

gempa-jabe commented 6 years ago

Any idea?

I don't have an idea which file on your side produces such an output. There should be a debug output before the error indicating the file which is currently read. Can you identify it or did you turn off debug logging?

petrrr commented 6 years ago

@gempa-jabe: Indeed as @vlauciani wrote above our problem is that we lost the information about which file exactly causes the problem, so we cannot inspect or test. And this already happened before. However, we do not loose this information because the debug level logging was turned off.

Instead the information (I think there are few lines with the request, streams, etc. in the log) is simple flushed out of the logs. The related error message ([error/SDSARCHIVE] file read exception: Invalid miniSEED header) is produced at such a high rate, that all available logfiles were filled within few minutes. Just observe the modification time of the files above. And for comparison: The problem was first detected yesterday night at 21:55:37 UTC.

So what we could do to conserve the information without then blocking the server completely?

We'll also try to recover some useful information independently from SC3, but we are not sure that is possible.

BTW: I think we are now dealing with a slightly different problem. Should we move to a separate issue?

gempa-jabe commented 6 years ago

I also think that this is a different issue. This issue dealt with an infinite loop which was fixed. Now we are dealing with either corrupt files or a bad file read index. Please open a new issue for that.