NOAA-GSL / VxIngest

Other
2 stars 0 forks source link

ingest process monitoring seems to be not working #232

Closed randytpierce closed 9 months ago

randytpierce commented 1 year ago

It seems that the ingest process monitoring is broken somehow. I see metrics being generated but they do not look correct to me. I see the node_exporter service running and the textfile collector is specified to scrape the metrics directory i.e. --collector.textfile.directory=/data/common/job_metrics but I don't see any data in the graphana dashboard for ingest processes. This seems important so I'm working on it now.

randytpierce commented 1 year ago

Using the command "sudo journalctl -u node_exporter.service -r" to look at the node_exporter.service log output I can see errors like the following...

Aug 29 19:02:12 adb-cb1.gsd.esrl.noaa.gov node_exporter[1766]: ts=2023-08-29T19:02:12.413Z caller=textfile.go:219 level=error collector=textfile msg="failed to collect textfile data" file=job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom err="failed to parse textfile data from \"/data/common/job_metrics/job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom\": text format parsing error in line 8: expected float as value, got \"\"" Aug 29 19:01:57 adb-cb1.gsd.esrl.noaa.gov node_exporter[1766]: ts=2023-08-29T19:01:57.419Z caller=textfile.go:219 level=error collector=textfile msg="failed to collect textfile data" file=job_v01_metar_netcdf_obs_adb_cb1.prom err="failed to parse textfile data from \"/data/common/job_metrics/job_v01_metar_netcdf_obs_adb_cb1.prom\": text format parsing error in line 8: expected float as value, got \"\""

And these errors appear to be in all of these jobs... job_v01_metar_grib2_model_rapops130_adb_cb1.prom job_v01_metar_grib2_model_hrrr_adb_cb1.prom job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom job_v01_metar_netcdf_obs_adb_cb1.prom job_v01_metar_grib2_model_rapops130_adb_cb1.prom job_v01_metar_grib2_model_hrrr_adb_cb1.prom job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom job_v01_metar_netcdf_obs_adb_cb1.prom

which is pretty much all of the ingest jobs that matter. So there is a bug in the scraper somewhere.

randytpierce commented 1 year ago

editing /data/common/job_metrics/job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom and looking at line 8 shows there is no number at the end of the line 8 job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1{ingest_id="ingest_recorded_record_count",log_file="/data/temp_tar/tmp.Frz81qGaqL/job_v01_metar_ctc_sum_model_hrrrrap130-2023-08-29:19:00:02.log"} which is the recorded record count. Looking into why that is.

bonnystrong commented 1 year ago

Did you get this figured out?

On Tue, Aug 29, 2023 at 1:15 PM randytpierce @.***> wrote:

editing /data/common/job_metrics/job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom and looking at line 8 shows there is no number at the end of the line 8 job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1{ingest_id="ingest_recorded_record_count",log_file="/data/temp_tar/tmp.Frz81qGaqL/job_v01_metar_ctc_sum_model_hrrrrap130-2023-08-29:19:00:02.log"} which is the recorded record count. Looking into why that is.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/issues/232#issuecomment-1697995642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG6HZOXACLWV5AZMVQP37X3XXY5WTANCNFSM6AAAAAA4DMPZCM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Bonny Strong NOAA/GSL and CIRA office: (719) 301-6195 or home: (970) 669-1188

randytpierce commented 1 year ago

I'm working on it at the moment. The ingest is happening but not the monitoring. randy

On Tue, Aug 29, 2023 at 5:00 PM bonnystrong @.***> wrote:

Did you get this figured out?

On Tue, Aug 29, 2023 at 1:15 PM randytpierce @.***> wrote:

editing

/data/common/job_metrics/job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1.prom

and looking at line 8 shows there is no number at the end of the line 8

job_v01_metar_ctc_sum_model_hrrrrap130_adb_cb1{ingest_id="ingest_recorded_record_count",log_file="/data/temp_tar/tmp.Frz81qGaqL/job_v01_metar_ctc_sum_model_hrrrrap130-2023-08-29:19:00:02.log"}

which is the recorded record count. Looking into why that is.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/issues/232#issuecomment-1697995642,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AG6HZOXACLWV5AZMVQP37X3XXY5WTANCNFSM6AAAAAA4DMPZCM>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Bonny Strong NOAA/GSL and CIRA office: (719) 301-6195 or home: (970) 669-1188

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/issues/232#issuecomment-1698252159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPXQH6N66Q6JRRKUSQ3XXZYA5ANCNFSM6AAAAAA4DMPZCM . You are receiving this because you were assigned.Message ID: @.***>

-- Randy Pierce

randytpierce commented 1 year ago

This was caused by a bug in the import docs routine. fixed now.

randytpierce commented 10 months ago

discovered another problem with this. Some of the scraped fields are being parsed wrongly because we have prepended some fields to the newer log messages. Need to adjust the parsing to match.