MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

"Name or service not known", even though dd5.weather.gc.ca resolves #488

Closed endlisnis closed 2 years ago

endlisnis commented 2 years ago

subscribe_rolf_01.log.gz

Shortly after 19:00 tonight (but not all at the same time; some were off by more than 10 minutes), all 4 threads of my sr3 instance started reporting "Name or service not known" on (I assume from the context) "dd5.weather.gc.ca". This was persistent for at least an hour before I noticed. I tried running "ping dd5.weather.gc.ca" in a different window, and it had no trouble resolving the name.

I restarted the service and it had no trouble resolving the name after the restart.

petersilva commented 2 years ago

looking at the log:

2022-03-26 10:16:27,392 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 6.77 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWBU/2022-03-26-1415-CWBU-AUTO-minute-swob.xml
2022-03-26 10:16:27,393 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 6.77 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWBZ/2022-03-26-1415-CWBZ-AUTO-minute-swob.xml
2022-03-26 10:16:27,393 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.73 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWJR/2022-03-26-1415-CWJR-AUTO-minute-swob.xml
2
2022-03-26 10:16:27,393 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.73 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWJR/2022-03-26-1415-CWJR-AUTO-minute-swob.xml
2022-03-26 10:21:34,492 [WARNING] sarracenia.transfer.https __open__ unable to open https://dd5.weather.gc.ca//observations/swob-ml/20220326/CWJR/2022-03-26-1415-CWJR-AUTO-minute-swob.xml
 .
 .
 .
2022-03-26 11:02:27,889 [ERROR] sarracenia.moth.amqp __getSetup connecting to: q_anonymous.rolf.20211222, durable: True, expire: 14400.0, auto_delete=False
2022-03-26 11:02:27,890 [ERROR] sarracenia.moth.amqp __getSetup AMQP getSetup failed to dd.weather.gc.ca with failed to resolve broker hostname
2022-03-26 11:03:34,274 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220326/CAWR/2022-03-26-1415-CAWR-AUTO-minute-swob.xml
2022-03-26 11:03:34,275 [INFO] sarracenia.flowcb.log after_work downloaded ok: /home/rolf/weather/swobMirror/observations/swob-ml/20220326/CWSR/2022-03-26-1415-CWSR-AUTO-minute-swob.xml
   .
   .
   .
022-03-26 11:03:36,830 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1716.07 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/COGI/2022-03-26-1434-COGI-AUTO-minute-swob.xml
2022-03-26 11:03:36,831 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1716.07 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CVOU/2022-03-26-1434-CVOU-AUTO-minute-swob.xml
2022-03-26 11:03:36,831 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1716.06 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWKH/2022-03-26-1434-CWKH-AUTO-minute-swob.xml

so there was a 40 minute time when sr3 could not get a hostname for the server. And then it started working again. Prior to the problem lag was 6 seconds... afterwards... somewhat unsurprisingly, it was nearly half an hour (>28 minutes) It worked to catch up.


2022-03-26 11:15:58,493 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 3.64 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CYBB/2022-03-26-1500-CYBB-MAN-COR1-swob.xml
2022-03-26 11:15:58,493 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.62 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWSV/2022-03-26-1515-CWSV-AUTO-minute-swob.xml
2022-03-26 11:15:58,493 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.62 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWRM/2022-03-26-1515-CWRM-AUTO-minute-swob.xml
2022-03-26 11:15:58,493 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.61 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWQS/2022-03-26-1500-CWQS-AUTO-swob.xml

it has caught up again... 12 minutes after restoral. Then it happens again...

2022-03-26 19:03:19,596 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1.72 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWOK/2022-03-26-2301-CWOK-AUTO-minute-swob.xml
2022-03-26 19:04:15,810 [ERROR] sarracenia.transfer.https __open__ Download failed 5 https://dd5.weather.gc.ca//observations/swob-ml/20220326/CWQC/2022-03-26-2259-CWQC-AUTO-minute-swob.xml
2022-03-26 19:04:15,810 [ERROR] sarracenia.transfer.https __open__ Failed to reach server. Reason: [Errno -2] Name or service not known
2022-03-26 19:04:15,810 [WARNING] sarracenia.flow download failed to write tmp/2022-03-26-2259-CWQC-AUTO-minute-swob.xml: <urlopen error [Errno -2] Name or service not known>
 .
 . 
 .

022-03-26 20:17:30,852 [INFO] sarracenia.flowcb.log __init__ subscribe initialized with: {'after_work', 'after_accept', 'on_housekeeping'}
2022-03-26 20:17:30,852 [INFO] sarracenia.flow run callbacks loaded: ['sarracenia.flowcb.gather.message.Message', 'sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'rxpipe_gzip.RxPipe_gzip', 'sarracenia.flowcb.log.Log']
2022-03-26 20:17:30,852 [INFO] sarracenia.flow run pid: 2099530 subscribe/rolf instance: 1
2022-03-26 20:17:32,866 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 1574.09 ) https://dd5.weather.gc.ca /observations/swob-ml/20220326/CWLE/2022-03-26-2350-CWLE-AUTO-minute-swob.xml

this time for over an hour. I think the app is taking one connection failure, and re-using the failed connection for an hour instead of throwing it away and setting up a new one. I will look.

petersilva commented 2 years ago

please git pull & restart

endlisnis commented 2 years ago

After the restart, my log got much more verbose:

2022-03-27 11:08:03,444 [INFO] sarracenia.flowcb.log post posted {'sundew_extension': 'DMS:WXO_RENAMED_SWOB2:MSC:XML::20220327150759', 'from_cluster': 'DDSR.CMC', 'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM',
 'filename': 'msg_ddsr-WXO-DD_0dcaddf4b27f7fec67f8a16ddd0a7a5d:DMS:WXO_RENAMED_SWOB2:MSC:XML::20220327150759', 'source': 'WXO-DD', 'mtime': '20220327T150800.962', 'atime': '20220327T150800.962', 'subtopic': ['home
', 'rolf', 'weather', 'swobMirror', 'observations', 'swob-ml', '20220327', 'CVQZ'], '_deleteOnPost': {'data_checksum', 'new_path', 'new_subtopic', 'new_file', 'exchange', 'new_dir', 'onfly_checksum', 'subtopic', '
local_offset', 'new_inflight_path', 'report', 'new_relPath', 'new_baseUrl'}, 'pubTime': '20220327T150800.962', 'baseUrl': 'https://dd5.weather.gc.ca', 'relPath': 'home/rolf/weather/swobMirror/observations/swob-ml/
20220327/CVQZ/2022-03-27-1507-CVQZ-AUTO-minute-swob.xml', 'integrity': {'method': 'md5', 'value': 'wwENYG6E4CsbBhgPI6oqZg=='}, 'size': 9421, 'exchange': 'xpublic', 'local_offset': 0, 'new_dir': '/home/rolf/weather
/swobMirror/observations/swob-ml/20220327/CVQZ', 'new_file': '2022-03-27-1507-CVQZ-AUTO-minute-swob.xml', 'new_baseUrl': 'https://dd5.weather.gc.ca', 'new_relPath': 'home/rolf/weather/swobMirror/observations/swob-
ml/20220327/CVQZ/2022-03-27-1507-CVQZ-AUTO-minute-swob.xml', 'new_subtopic': ['home', 'rolf', 'weather', 'swobMirror', 'observations', 'swob-ml', '20220327', 'CVQZ'], 'new_inflight_path': 'tmp/2022-03-27-1507-CVQZ
-AUTO-minute-swob.xml', 'new_path': '/home/rolf/weather/swobMirror/observations/swob-ml/20220327/CVQZ/2022-03-27-1507-CVQZ-AUTO-minute-swob.xml', 'onfly_checksum': {'method': 'md5', 'value': 'wwENYG6E4CsbBhgPI6oqZ
g=='}, 'data_checksum': '1B2M2Y8AsgTpgAmY7PhCfg==', 'report': {'code': 201, 'message': 'Download successful /home/rolf/weather/swobMirror/observations/swob-ml/20220327/CVQZ/2022-03-27-1507-CVQZ-AUTO-minute-swob.xm
l'}}
petersilva commented 2 years ago

oh... yeah... that... um the default for when to post messages changed, and now includes when posting... but.. you aren't posting... that's odd. ok meantime, a line like this:


logEvents after_accept,after_work,on_housekeeping

will change it back.

petersilva commented 2 years ago

logEvents change reverted. you can just git pull and don't need the config line anymore.

petersilva commented 2 years ago

so we'll wait a few days, but I think we did fix the particular report you were seeing. Please let me know if it recurs.