DataDog / dd-agent

Datadog Agent Version 5
https://docs.datadoghq.com/
Other
1.3k stars 812 forks source link

haproxy logs -> dogstream #75

Closed alq666 closed 10 years ago

alq666 commented 12 years ago

What we need:

List of requests (including some patterns, but discarding everything after ?param=value): https://gist.github.com/2919020

Log sample (ignore first 2 fields, created by syslog)

2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40074 [12/Jun/2012:17:56:10.548] dogarchive-frontend dogarchive-backend/i-8bcc18ed:9105 14/0/0/283/575 200 107 - - ---- 232/162/162/5/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40079 [12/Jun/2012:17:56:10.636] dogarchive-frontend dogarchive-backend/i-8bcc18ed:9103 10/0/1/75/488 200 107 - - ---- 231/161/161/6/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40095 [12/Jun/2012:17:56:10.728] dogarchive-frontend dogarchive-backend/i-23518d45:9102 7/0/1/354/456 200 107 - - ---- 230/160/160/5/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28632 [12/Jun/2012:17:56:11.069] public dogdispatcher/i-8bcc18ed:9001 2/0/0/40/115 202 151 - - ---- 229/69/1/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 127.0.0.1:40103 [12/Jun/2012:17:56:10.864] dogarchive-frontend dogarchive-backend/i-23518d45:9103 4/0/1/70/331 200 107 - - ---- 230/160/159/7/0 0/0 "POST /intake HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28629 [12/Jun/2012:17:56:10.914] public dogdispatcher/i-8bcc18ed:9000 89/0/1/198/308 202 151 - - ---- 233/70/2/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28633 [12/Jun/2012:17:56:11.079] public dogdispatcher/i-6529e303:9000 139/0/4/11/174 202 151 - - ---- 235/70/1/0/0 0/0 "POST /intake/ HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28635 [12/Jun/2012:17:56:11.185] public dogweb/i-23518d45 33/0/4/45/82 200 424 - - ---- 235/69/0/0/0 0/0 "GET /reports/v1/agents HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28636 [12/Jun/2012:17:56:11.222] public dogdispatcher/i-6529e303:9001 31/0/1/110/144 202 166 - - ---- 237/69/1/0/0 0/0 "POST /api/v1/series?api_key=REDACTED HTTP/1.1"
2012-06-12T17:56:11+00:00 localhost haproxy[21732]: 10.125.74.200:28634 [12/Jun/2012:17:56:11.103] public dogdispatcher/i-23518d45:9000 81/0/10/189/281 202 151 - - ---- 239/69/0/0/0 0/0 "POST /intake?api_key=REDACTED HTTP/1.1"

If you need to extract page list again:

gawk '{u = $(NF-1); split(u, p, "?"); req[p[1]]=1;} END {for (r in req) {print r}}' /mnt/log/haproxy_1.log

Last, haproxy timings are the first group of /-separated values

33/0/4/45/82

>>> Feb  6 12:14:14 localhost \
      haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in \
      static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} \
      {} "GET /index.html HTTP/1.1"
  Field   Format                                Extract from the example above
      1   process_name '[' pid ']:'                            haproxy[14389]:
      2   client_ip ':' client_port                             10.0.1.2:33317
      3   '[' accept_date ']'                       [06/Feb/2009:12:14:14.655]
      4   frontend_name                                                http-in
      5   backend_name '/' server_name                             static/srv1
      6   Tq '/' Tw '/' Tc '/' Tr '/' Tt*                       10/0/30/69/109
      7   status_code                                                      200
      8   bytes_read*                                                     2750
      9   captured_request_cookie                                            -
     10   captured_response_cookie                                           -
     11   termination_state                                               ----
     12   actconn '/' feconn '/' beconn '/' srv_conn '/' retries*    1/1/1/1/0
     13   srv_queue '/' backend_queue                                      0/0
     14   '{' captured_request_headers* '}'                   {haproxy.1wt.eu}
     15   '{' captured_response_headers* '}'                                {}
     16   '"' http_request '"'                      "GET /index.html HTTP/1.1"

8.4. Timing events

Timers provide a great help in troubleshooting network problems. All values are reported in milliseconds (ms). These timers should be used in conjunction with the session termination flags. In TCP mode with "option tcplog" set on the frontend, 3 control points are reported under the form "Tw/Tc/Tt", and in HTTP mode, 5 control points are reported under the form "Tq/Tw/Tc/Tr/Tt" :

These timers provide precious indications on trouble causes. Since the TCP protocol defines retransmit delays of 3, 6, 12... seconds, we know for sure that timers close to multiples of 3s are nearly always related to lost packets due to network problems (wires, negociation, congestion). Moreover, if "Tt" is close to a timeout value specified in the configuration, it often means that a session has been aborted on timeout.

Most common cases :

Other noticeable HTTP log cases ('xx' means any value to be ignored) :

Tq/Tw/Tc/Tr/+Tt The "option logasap" is present on the frontend and the log was emitted before the data phase. All the timers are valid except "Tt" which is shorter than reality.

-1/xx/xx/xx/Tt The client was not able to send a complete request in time or it aborted too early. Check the session termination flags then "timeout http-request" and "timeout client" settings.

Tq/-1/xx/xx/Tt It was not possible to process the request, maybe because servers were out of order, because the request was invalid or forbidden by ACL rules. Check the session termination flags.

Tq/Tw/-1/xx/Tt The connection could not establish on the server. Either it actively refused it or it timed out after Tt-(Tq+Tw) ms. Check the session termination flags, then check the "timeout connect" setting. Note that the tarpit action might return similar-looking patterns, with "Tw" equal to the time the client connection was maintained open.

Tq/Tw/Tc/-1/Tt The server has accepted the connection but did not return a complete response in time, or it closed its connexion unexpectedly after Tt-(Tq+Tw+Tc) ms. Check the session termination flags, then check the "timeout server" setting.

alq666 commented 12 years ago

@clofresh feel free to weigh in

alq666 commented 10 years ago

Dogstream is not performant enough to handle high-throughput.