NOAA-OWP / hydrovis

Other
11 stars 8 forks source link

Errors in container owp-hml-ingester_hml_cleaner in both hv-vpp-prod-data-ingest EC2 instances #775

Open DrixTabligan-NOAA opened 2 months ago

DrixTabligan-NOAA commented 2 months ago

We are seeing errors in the logs related to postgres connection failure for the container owp-hml-ingester_hml_cleaner.

A sample capture of the logs are below...

Jun 14 07:10:01 ip-10-27-63-150 docker/owp-hml-ingester_hml_cleaner_1[4424]: 2024-06-14 07:10:01 UTC b45eab75d2ac hml_cleaner.py[1656]: ERROR: hml_cleaner.py[main[line 91]]: init.py[connect[line 127]]: FATAL: password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL: no pg_hba.conf entry for host "10.27.63.150", user "rfc_fcst_user", database "rfcfcst", no encryption : Jun 14 08:10:01 ip-10-27-63-150 docker/owp-hml-ingester_hml_cleaner_1[4424]: 2024-06-14 08:10:01 UTC b45eab75d2ac hml_cleaner.py[1662]: ERROR: hml_cleaner.py[main[line 91]]: init.py[connect[line 127]]: FATAL: password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL: no pg_hba.conf entry for host "10.27.63.150", user "rfc_fcst_user", database "rfcfcst", no encryption : Jun 14 09:10:01 ip-10-27-63-150 docker/owp-hml-ingester_hml_cleaner_1[4424]: 2024-06-14 09:10:01 UTC b45eab75d2ac hml_cleaner.py[1668]: ERROR: hml_cleaner.py[main[line 91]]: init.py[connect[line 127]]: FATAL: password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL: no pg_hba.conf entry for host "10.27.63.150", user "rfc_fcst_user", database "rfcfcst", no encryption :

Security groups for the RDS instance already contains the /23. Since RDS has been configured to require SSL, this is most likely that the client is attempting to connect without encryption.

Ticket was created given that the container, based on the name, is supposed to do cleanup? Since it is failing, most likely build up of the uncleaned entries/rows will occur.

AndersNilssonNoaa commented 2 months ago

The current hypothesis is that this is a side effect of the EC2 root volume filling up due to unrotated docker logs.

DrixTabligan-NOAA commented 2 months ago

It looks likely it is not related to disk space issues.

Even after the cleanup with 23% disk space utilization (77% free), the errors are still recurring.

This has been observed for the past three hours.

Jun 14 16:10:02 ip-10-27-63-72 docker/owp-hml-ingester_hml_cleaner_1[4431]: 2024-06-14 16:10:02 UTC da4a48a05317 hml_cleaner.py[9]: ERROR: hml_cleaner.py[main[line 91]]: __init__.py[connect[line 127]]: FATAL:  password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL:  no pg_hba.conf entry for host "10.27.63.72", user "rfc_fcst_user", database "rfcfcst", no encryption :
Jun 14 17:10:01 ip-10-27-63-72 docker/owp-hml-ingester_hml_cleaner_1[4431]: 2024-06-14 17:10:01 UTC da4a48a05317 hml_cleaner.py[15]: ERROR: hml_cleaner.py[main[line 91]]: __init__.py[connect[line 127]]: FATAL:  password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL:  no pg_hba.conf entry for host "10.27.63.72", user "rfc_fcst_user", database "rfcfcst", no encryption :
Jun 14 18:10:01 ip-10-27-63-72 docker/owp-hml-ingester_hml_cleaner_1[4431]: 2024-06-14 18:10:01 UTC da4a48a05317 hml_cleaner.py[21]: ERROR: hml_cleaner.py[main[line 91]]: __init__.py[connect[line 127]]: FATAL:  password authentication failed for user "rfc_fcst_user" : password retrieved from file "/root/.pgpass" : FATAL:  no pg_hba.conf entry for host "10.27.63.72", user "rfc_fcst_user", database "rfcfcst", no encryption :
AndersNilssonNoaa commented 2 months ago

Next test would be to see if we can establish psql connections from one of the hosts, using authentication information from the container ~root/.pgpass file.

AndersNilssonNoaa commented 2 months ago

The issue is a failure in the setup script as the .pgpass file currently contains the wrong information. The setup script would normally update the .pgpass file, but there is erroneously an already existing .pgpass file that the script does not want to overwrite. The solution is to remove the .pgpass file from the ingester source tar file, and that .pgpass file will get generated correctly.