NOAA-GSL / VxIngest

Other
2 stars 0 forks source link

vxmonitor showing record difference indicates database timeout problem #330

Closed randytpierce closed 5 months ago

randytpierce commented 6 months ago

We have gotten a couple of prometheus alerts indicating record count differences. Looking into one and comparing the timestamp to the archives it leads to this archive file ...

/data-ingest/data/xfer/archive/success-job_v01_metar_ctc_ceiling_model_ops_1708078206.tar.gz

and opening that archive file with vim I look at this log ...

20240216101006/job_v01_metar_ctc_ceiling_model_ops-2024-02-16T10:10:06.log

and I find these errors ....

2024-02-16T10:16:40+0000 [ERROR] <VxIngestManager-22> (vxingest.ctc_to_cb.ctc_builder): CTCModelObsBuilderV01 Error getting model document: UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=KeyValueErrorContext:{'retry_attempts': 0, 'key': 'DD:V01:METAR:HRRR_OPS:1708063200:17', 'bucket_name': 'vxdata', 'scope_name': '_default', 'collection_name': 'METAR', 'opaque': 51}, C Source=/couchbase-python-client/src/kv_ops.cxx:212>)
2024-02-16T10:16:40+0000 [INFO] <VxIngestManager-22> (vxingest.ctc_to_cb.ctc_builder): Looking up observation document: DD:V01:METAR:obs:1708063200
2024-02-16T10:16:40+0000 [ERROR] <VxIngestManager-24> (vxingest.ctc_to_cb.ctc_builder): CTCModelObsBuilderV01 Error getting model document: UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=KeyValueErrorContext:{'retry_attempts': 0, 'key': 'DD:V01:METAR:HRRR_OPS:1708063200:24', 'bucket_name': 'vxdata', 'scope_name': '_default', 'collection_name': 'METAR', 'opaque': 55}, C Source=/couchbase-python-client/src/kv_ops.cxx:212>) 
2024-02-16T10:16:40+0000 [INFO] <VxIngestManager-24> (vxingest.ctc_to_cb.ctc_builder): Looking up observation document: DD:V01:METAR:obs:1708063200
2024-02-16T10:16:41+0000 [ERROR] <VxIngestManager-20> (vxingest.ctc_to_cb.ctc_builder): CTCModelObsBuilderV01 problem getting obs document: UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=KeyValueErrorContext:{'retry_attempts': 0, 'key': 'DD:V01:METAR:obs:1708066800', 'bucket_name': 'vxdata', 'scope_name': '_default', 'collection_name': 'METAR', 'opaque': 90}, C Source=/couchbase-python-client/src/kv_ops.cxx:212>)
Traceback (most recent call last):
  File "/app/vxingest/ctc_to_cb/ctc_builder.py", line 375, in handle_fcstValidEpochs
    _obs_doc = self.load_spec["collection"].get(obs_id)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/couchbase/collection.py", line 176, in get
    return self._get_internal(key, **final_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.11/site-packages/couchbase/logic/wrappers.py", line 138, in wrapped_fn
    raise e
  File "/app/.venv/lib/python3.11/site-packages/couchbase/logic/wrappers.py", line 116, in wrapped_fn
    raise ErrorMapper.build_exception(ret)
couchbase.exceptions.UnAmbiguousTimeoutException: UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=KeyValueErrorContext:{'retry_attempts': 0, 'key': 'DD:V01:METAR:obs:1708066800', 'bucket_name': 'vxdata', 'scope_name': '_default', 'collection_name': 'METAR', 'opaque': 90}, C Source=/couchbase-python-client/src/kv_ops.cxx:212>)
2024-02-16T10:16:41+0000 [INFO] <VxIngestManager-23> (vxingest.ctc_to_cb.ctc_builder): Looking up model document: DD:V01:METAR:HRRR_OPS:1708063200:36

These indicate a timeout problem talking to the database on adb-cb1. This also happened with connections and we fixed that problem with a retry. Perhaps the same thing is needed here, although it would be good to know why we are getting timeouts talking to the database.

randytpierce commented 6 months ago

We encountered an alert on Feb 22 and investigation turned up... I did look into the log message for this error and indeed it is due to a query timeout while querying for a region (metadata). The log file is in this archive (you can find the appropriate archive in /data-ingest/data/xfer/archive by comparing the dates of the tar.gz files) ... /data-ingest/data/xfer/archive/success-job_v01_metar_partial__sums_surface_model_hrrrrap130_1708611006.tar.gz

And this is the error... 2024-02-22T14:25:32+0000 [ERROR] (vxingest.partial_sums_to_cb.partial_sums_builder): PartialSumsSurfaceModelObsBuilderV01: Exception with builder: error: UnAmbiguousTimeoutException(<ec=14, category=couchbase.common, message=unambiguous_timeout (14), context=QueryErrorContext({'last_dispatched_to': '137.75.164.72:8093', 'last_dispatched_from': '172.19.0.2:49034', 'retry_attempts': 0, 'client_context_id': 'e273cf-d29c-ab48-fc8f-41c30db8743250', 'method': 'POST', 'path': '/query/service', 'http_status': 0, 'http_body': '', 'first_error_code': 0, 'first_error_message': '', 'statement': "SELECT geo.bottom_right.lat as br_lat,\n geo.bottom_right.lon as br_lon,\n geo.top_left.lat as tl_lat,\n geo.top_left.lon as tl_lon\n FROM vxdata._default.METAR\n WHERE type='MD'\n and docType='region'\n and subset='COMMON'\n and version='V01'\n and name='E_US'", 'parameters': '{"client_context_id":"e273cf-d29c-ab48-fc8f-41c30db8743250","metrics":false,"readonly":true,"statement":"SELECT geo.bottom_right.lat as br_lat,\n geo.bottom_right.lon as br_lon,\n geo.top_left.lat as tl_lat,\n geo.top_left.lon as tl_lon\n FROM vxdata._default.METAR\n WHERE type=\'MD\'\n and docType=\'region\'\n and subset=\'COMMON\'\n and version=\'V01\'\n and name=\'E_US\'","timeout":"119500ms"}', 'context_type': 'QueryErrorContext'}), C Source=/couchbase-python-client/src/n1ql.cxx:277>) ...

From which I extracted the queries...

SELECT geo.bottom_right.lat as br_lat, geo.bottom_right.lon as br_lon, geo.top_left.lat as tl_lat, geo.top_left.lon as tl_lon FROM vxdata._default.METAR WHERE type = 'MD' and docType = 'region' and subset = 'COMMON' and version = 'V01' and name = 'E_US'

AND

SELECT geo.bottom_right.lat as br_lat, geo.bottom_right.lon as br_lon, geo.top_left.lat as tl_lat, geo.top_left.lon as tl_lon FROM vxdata._default.METAR WHERE type = 'MD' AND docType='region' AND subset='COMMON' AND version='V01'

which are valid queries and work otherwise.

So we are seeing query timeouts against adb-cb1 from the python client in the ingest code.

randytpierce commented 5 months ago

There was some retry code added to a couple of problematic areas but this problem seems to have been resolved. We are not getting these alerts anymore.