Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.22k stars 1.05k forks source link

Negative numbers for gl2_processing_duration_ms #19696

Open damianharouff opened 2 weeks ago

damianharouff commented 2 weeks ago

While reviewing gl2_processing_duration_ms, it was noted that some messages have a negative processing time associated with them:

image

image

Doesn't seem to have any specific case or reason, also happens across multiple sources, some processed by Illuminate, some not:

image

Happy to give Graylogger access to my cluster for further troubleshooting if not able to replicate in a dev environment.

Your Environment

damianharouff commented 2 weeks ago

It also seems to be only "large" negative numbers, e.g. it's not a few ms negative, it's minimum ~ -10000 ms:

image

drewmiranda-gl commented 2 weeks ago

I just checked my environment and i also have negative values.

image

Probably unrelated, i see that gl2_processing_duration_ms is showing 2 different types:

image
bernd commented 1 week ago

@damianharouff @drewmiranda-gl Can you post the full message (ideally from OpenSearch) for an affected message so we can see all the timestamp and timing-related fields?

damianharouff commented 1 week ago

@bernd absolutely, here's a few, let me know if you need more:

https://paste.cekkent.net/92504773-3595-4a7a-ace4-e863f6d6a340/JKmdJEU1 https://paste.cekkent.net/945a30ab-c805-426c-92df-35c434f0d9c4/4bA2VmGn https://paste.cekkent.net/6ba7587e-d311-4459-99e0-b2bbc1a812d9/vFVYzflR https://paste.cekkent.net/87f52147-e46e-48a6-ac60-1606953674f9/RWYvXGyu https://paste.cekkent.net/5ed6011a-7741-4ad0-8f27-2c8edac00c70/N8DHPfdY

bernd commented 1 week ago

@damianharouff Thank you! :pray: That helped!

Looking at the first message. :eyes: The problem is that the receive timestamp is after the processing timestamp. So we need to figure out how that happens.

"gl2_receive_timestamp":    "2024-06-26 13:25:30.462",
"gl2_processing_timestamp": "2024-06-26 13:25:18.192",
"gl2_processing_duration_ms": -12270,
drewmiranda-gl commented 1 week ago

For what its worth, in my environment this only happened on a specific day and never before and never since.

image

Sanitized messages:

{
  "gl2_accounted_message_size": 405,
  "gl2_receive_timestamp": "2024-06-17 01:36:00.129",
  "gl2_remote_ip": "192.168.0.1",
  "gl2_remote_port": 55834,
  "streams": [
    "63068dc98a735a37e8d535f9"
  ],
  "gl2_message_id": "01J0HXCE41012S9S3M40133D5X",
  "gl2_source_input": "6328e4593521fe2ae2ee110c",
  "gl2_processing_timestamp": "2024-06-17 01:36:00.128",
  "gl2_source_node": "7a72f5ea-ccb3-4b6f-858b-bab2d1aaf7b5",
  "gl2_forwarder_input": "62bb56a525abac169687b4ab",
  "gl2_processing_duration_ms": -1,
  "timestamp": "2024-06-17T01:36:00.129Z"
}
{
  "gl2_accounted_message_size": 819,
  "gl2_receive_timestamp": "2024-06-15 17:50:53.277",
  "gl2_remote_ip": "192.168.0.108",
  "gl2_remote_port": 45175,
  "streams": [
    "000000000000000000000001"
  ],
  "gl2_message_id": "01J0EGC28V000GFMPF9M1X5KG0",
  "gl2_source_input": "63112596964e770359618042",
  "gl2_processing_timestamp": "2024-06-15 17:50:53.246",
  "gl2_source_node": "7a72f5ea-ccb3-4b6f-858b-bab2d1aaf7b5",
  "gl2_forwarder_input": "62bb56a525abac169687b4ab",
  "gl2_processing_duration_ms": -31,
  "timestamp": "2024-06-15T17:50:53.211Z"
}

If it matters, all of my ingest comes from forwarders.

damianharouff commented 1 week ago

I was wondering about that, and thought it was maybe due to incorrect time on one of the nodes, but GL and OSearch nodes are all showing a plausible time:

damian@Damians-MBP ansible % ansible -i hosts graylog -m shell -a "date"
192.168.69.201 | CHANGED | rc=0 >>
Wed Jun 26 02:14:15 PM UTC 2024
192.168.69.211 | CHANGED | rc=0 >>
Wed Jun 26 02:14:15 PM UTC 2024
192.168.69.210 | CHANGED | rc=0 >>
Wed Jun 26 02:14:15 PM UTC 2024
192.168.69.200 | CHANGED | rc=0 >>
Wed Jun 26 02:14:15 PM UTC 2024
192.168.69.202 | CHANGED | rc=0 >>
Wed Jun 26 02:14:16 PM UTC 2024
192.168.69.213 | CHANGED | rc=0 >>
Wed Jun 26 02:14:18 PM UTC 2024
192.168.69.212 | CHANGED | rc=0 >>
Wed Jun 26 02:14:18 PM UTC 2024
damian@Damians-MBP ansible %

Slight seconds skew is due to ansible running serially.

bernd commented 1 week ago

If it matters, all of my ingest comes from forwarders.

Yeah, I think it does. Damian's messages are also ingested via the Forwarder.

Thanks, folks! I think we provided enough context and info so the team can start debugging.

drewmiranda-gl commented 1 week ago

Posting some time skew info since is suspect this is contributing if not the cause

Negative processing time as seen in graylog:

image

Time skew for graylog server:

image

Time skew for applicable forwarder:

image
damianharouff commented 1 week ago

The statement about Forwarders prompted me to look at my forwarder box, and....

image

...turns out my forwarder box was 12 seconds in the future. I installed and activated ntp on it, and no more messages with a negative processing duration timestamp. It does make me wonder why only some of my messages were negative and not all, since all of my ingest is through this one forwarder.