deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.82k stars 310 forks source link

[BUG] deepflow l7flow trace info is incomplete #6728

Open kill51216 opened 3 months ago

kill51216 commented 3 months ago

Search before asking

DeepFlow Component

Agent

What you expected to happen

image 当我运行Agent采集一个api接口到mysql的所有执行过程时,我通过dashboard无法获取到完整的调用链信息。 api仅关联上了1-2个sql,而实际调用过程还是包含好几个SQL的(例如:SELECT id, minip, maxip, continent, areacode,比较重要就没有关联上。 我的应用程序是一个java的程序运行在docker里面。 然后我就开启debug,查看日志中的信息。

 [^[[38;5;27m2024-06-03 16:41:13.929053 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:244] ^[[38;5;27mextract id from sql SELECT  i      d, token, token_public, user_id, created_date, update_date, effective_date, expire_date,
12622                product_id, available, counts, scene_counts, is_test, notified, test_count,points_counts,description, daily_limit
12623         FROM tokens
12624         WHERE token_public = 'TVIA1CQCMvV3fIAK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I' and product_id = 29
12625         order by daily_limit desc^[[0m
12626 [^[[38;5;27m2024-06-03 16:41:13.929091 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:269] ^[[38;5;27mextracted trace_id=None span_      id=None^[[0m
...
12808 [^[[38;5;27m2024-06-03 16:41:14.072648 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/sender/uniform_sender.rs:508] ^[[38;5;27m2-protolog-to-collector-sender sender send ite      m l7_log: BoxAppProtoLogsData(MetaAppProto { base_info: AppProtoLogsBaseInfo { start_time: 1717404073.918621914s, end_time: 1717404073.918621914s, flow_id: 6869616292      337287252, tap_port: LMAC@none@d2:9d:ed:7a, signal_source: Packet, vtap_id: 4, tap_type: Cloud, tap_side: Rest, biz_type: 0, head: AppProtoHead { proto: MySQL, msg_ty      pe: Request, rrt: 0 }, mac_src: 00:00:00:00:00:00, mac_dst: 00:00:00:00:00:00, ip_src: 172.20.0.3, ip_dst: 192.168.8.50, l3_epc_id_src: 1, l3_epc_id_dst: 1, port_src:       60808, port_dst: 3306, req_tcp_seq: 1314349146, resp_tcp_seq: 0, gpid_0: 0, gpid_1: 0, ebpf_type: None, process_id_0: 0, process_id_1: 0, process_kname_0: "", proces      s_kname_1: "", syscall_trace_id_request: 0, syscall_trace_id_response: 0, syscall_trace_id_thread_0: 0, syscall_trace_id_thread_1: 0, syscall_coroutine_0: 0, syscall_      coroutine_1: 0, syscall_cap_seq_0: 0, syscall_cap_seq_1: 0, protocol: IpProtocol(6), is_vip_interface_src: false, is_vip_interface_dst: false, pod_id_0: 0, pod_id_1:       0 }, direction: ClientToServer, direction_score: 255, l7_info: MysqlInfo(MysqlInfo { msg_type: Request, is_tls: false, protocol_version: 10, server_version: "", serve      r_thread_id: 0, command: 3, context: "SELECT  id, token, token_public, user_id, created_date, update_date, effective_date, expire_date,\n               product_id, av      ailable, counts, scene_counts, is_test, notified, test_count,points_counts,description, daily_limit\n        FROM tokens\n        WHERE token_public = 'TVIA1CQCMvV3fI      AK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I' and product_id = 29\n        order by daily_limit desc", response_code: 0, error_code: None, affected_rows: 0, err      or_message: "", status: Ok, rrt: 0, statement_id: 0, captured_request_byte: 401, captured_response_byte: 0, trace_id: None, span_id: None, is_on_blacklist: false }) }      )^[[0m
...
12839 [^[[38;5;27m2024-06-03 16:41:14.095165 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:269] ^[[38;5;27mextracted trace_id=None span_      id=None^[[0m
12840 [^[[38;5;27m2024-06-03 16:41:14.095195 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:244] ^[[38;5;27mextract id from sql UPDATE to      kens
12841          SET counts = counts - 1
12842         WHERE token = 'TVIA1CQCMvV3fIAK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I'^[[0m
12843 [^[[38;5;27m2024-06-03 16:41:14.095227 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:269] ^[[38;5;27mextracted trace_id=None span_      id=None^[[0m
12844 [^[[38;5;27m2024-06-03 16:41:14.095251 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:244] ^[[38;5;27mextract id from sql UPDATE to      kens
12845          SET counts = counts - 1
12846         WHERE token = 'TVIA1CQCMvV3fIAK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I'^[[0m
12847 [^[[38;5;27m2024-06-03 16:41:14.095274 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:269] ^[[38;5;27mextracted trace_id=None span_      id=None^[[0m
...
13074 [^[[38;5;27m2024-06-03 16:41:15.233956 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/sender/uniform_sender.rs:508] ^[[38;5;27m2-protolog-to-collector-sender sender send ite      m l7_log: BoxAppProtoLogsData(MetaAppProto { base_info: AppProtoLogsBaseInfo { start_time: 1717404074.085365s, end_time: 1717404074.094001s, flow_id: 5044072245550011      70, tap_port: eBPF@3532, signal_source: EBPF, vtap_id: 4, tap_type: Cloud, tap_side: ClientProcess, biz_type: 0, head: AppProtoHead { proto: MySQL, msg_type: Session,       rrt: 8636 }, mac_src: 00:00:00:00:00:00, mac_dst: 00:00:00:00:00:00, ip_src: 172.20.0.3, ip_dst: 192.168.8.50, l3_epc_id_src: 1, l3_epc_id_dst: 1, port_src: 60808, p      ort_dst: 3306, req_tcp_seq: 1314349981, resp_tcp_seq: 190092739, gpid_0: 0, gpid_1: 0, ebpf_type: TracePoint, process_id_0: 3532, process_id_1: 0, process_kname_0: "j      ava", process_kname_1: "", syscall_trace_id_request: 449519604096423298, syscall_trace_id_response: 449519604096423299, syscall_trace_id_thread_0: 15027, syscall_trac      e_id_thread_1: 15027, syscall_coroutine_0: 0, syscall_coroutine_1: 0, syscall_cap_seq_0: 22, syscall_cap_seq_1: 23, protocol: IpProtocol(6), is_vip_interface_src: fal      se, is_vip_interface_dst: false, pod_id_0: 0, pod_id_1: 0 }, direction: ClientToServer, direction_score: 255, l7_info: MysqlInfo(MysqlInfo { msg_type: Request, is_tls      : false, protocol_version: 0, server_version: "", server_thread_id: 0, command: 3, context: "UPDATE tokens\n         SET counts = counts - 1 \n        WHERE token = '      TVIA1CQCMvV3fIAK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I'", response_code: 0, error_code: None, affected_rows: 1, error_message: "", status: Ok, rrt: 0, state      ment_id: 0, captured_request_byte: 141, captured_response_byte: 52, trace_id: None, span_id: None, is_on_blacklist: false }) })^[[0m
13075 [^[[38;5;27m2024-06-03 16:41:15.234010 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/sender/uniform_sender.rs:508] ^[[38;5;27m2-protolog-to-collector-sender sender send ite      m l7_log: BoxAppProtoLogsData(MetaAppProto { base_info: AppProtoLogsBaseInfo { start_time: 1717404073.896353s, end_time: 1717404074.097716s, flow_id: 1441192543653614      62, tap_port: eBPF@3532, signal_source: EBPF, vtap_id: 4, tap_type: Cloud, tap_side: ServerProcess, biz_type: 0, head: AppProtoHead { proto: Http1, msg_type: Session,       rrt: 201363 }, mac_src: 00:00:00:00:00:00, mac_dst: 00:00:00:00:00:00, ip_src: 172.20.0.1, ip_dst: 172.20.0.3, l3_epc_id_src: 1, l3_epc_id_dst: 1, port_src: 30708, p      ort_dst: 19003, req_tcp_seq: 1687651067, resp_tcp_seq: 1527825521, gpid_0: 0, gpid_1: 0, ebpf_type: TracePoint, process_id_0: 0, process_id_1: 3532, process_kname_0:       "", process_kname_1: "java", syscall_trace_id_request: 161289227944711483, syscall_trace_id_response: 449519604096423299, syscall_trace_id_thread_0: 5297, syscall_tra      ce_id_thread_1: 15027, syscall_coroutine_0: 0, syscall_coroutine_1: 0, syscall_cap_seq_0: 0, syscall_cap_seq_1: 1, protocol: IpProtocol(6), is_vip_interface_src: fals      e, is_vip_interface_dst: false, pod_id_0: 0, pod_id_1: 0 }, direction: ClientToServer, direction_score: 255, l7_info: HttpInfo(HttpInfo { headers_offset: 0, is_req_en      d: false, is_resp_end: false, rrt: 0, proto: Http1, is_tls: false, msg_type: Request, raw_data_type: RawProtocol, stream_id: None, version: V1_0, trace_id: "", span_i      d: "", method: Get, path: "/ip/geo/v1/city/?key=TVIA1CQCMvV3fIAK4Cdo5CUjaxicVyQCInsLcvgwAdXMGOl9yaltyx9RFdHeEl0I&ip=1.193.4.3&area=single&coordsys=BD09", host: "api.i      pplus360.com", user_agent: None, referer: None, client_ip: None, x_request_id_0: "", x_request_id_1: "", req_content_length: None, resp_content_length: None, status_c      ode: 200, status: Ok, endpoint: Some("/ip/geo"), custom_result: None, custom_exception: None, captured_request_byte: 402, captured_response_byte: 530, attributes: [],       is_on_blacklist: false }) })^[[0m
12787 [^[[38;5;27m2024-06-03 16:41:14.072262 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/flow_generator/protocol_logs/sql/mysql.rs:244] ^[[38;5;27mextract id from sql SELECT
12788                         id, minip, maxip, continent, areacode,asnumber,adcode,radius,country, province, city, bd_lon AS lng, bd_lat AS lat,zipcode,timezone,accuracy,       owner,source,isp
12789                 FROM youxin_geoip_ask_city
12790                 WHERE minip<=29426691 ORDER BY minip DESC limit 1^[[0m
...
14071 [^[[38;5;27m2024-06-03 16:41:21.920383 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/sender/uniform_sender.rs:508] ^[[38;5;27m2-protolog-to-collector-sender sender send ite      m l7_log: BoxAppProtoLogsData(MetaAppProto { base_info: AppProtoLogsBaseInfo { start_time: 1717404080.837692s, end_time: 1717404080.837787s, flow_id: 2882344424412173      50, tap_port: eBPF@28458, signal_source: EBPF, vtap_id: 4, tap_type: Cloud, tap_side: ServerProcess, biz_type: 0, head: AppProtoHead { proto: Http2, msg_type: Session      , rrt: 95 }, mac_src: 00:00:00:00:00:00, mac_dst: 00:00:00:00:00:00, ip_src: 172.17.0.2, ip_dst: 172.17.0.2, l3_epc_id_src: 1, l3_epc_id_dst: 1, port_src: 59240, port      _dst: 7848, req_tcp_seq: 1895073802, resp_tcp_seq: 1299502451, gpid_0: 0, gpid_1: 0, ebpf_type: TracePoint, process_id_0: 0, process_id_1: 28458, process_kname_0: "",       process_kname_1: "grpc-default-wo", syscall_trace_id_request: 521577198134351215, syscall_trace_id_response: 521577198134351215, syscall_trace_id_thread_0: 28778, sy      scall_trace_id_thread_1: 28778, syscall_coroutine_0: 0, syscall_coroutine_1: 0, syscall_cap_seq_0: 151, syscall_cap_seq_1: 152, protocol: IpProtocol(6), is_vip_interf      ace_src: false, is_vip_interface_dst: false, pod_id_0: 0, pod_id_1: 0 }, direction: ClientToServer, direction_score: 0, l7_info: HttpInfo(HttpInfo { headers_offset: 0      , is_req_end: false, is_resp_end: false, rrt: 0, proto: Http2, is_tls: false, msg_type: Request, raw_data_type: RawProtocol, stream_id: Some(53207), version: V2, trac      e_id: "", span_id: "", method: Post, path: "", host: "", user_agent: None, referer: None, client_ip: None, x_request_id_0: "", x_request_id_1: "", req_content_length:       Some(54), resp_content_length: Some(24), status_code: 200, status: Ok, endpoint: None, custom_result: None, custom_exception: None, captured_request_byte: 86, captur      ed_response_byte: 56, attributes: [], is_on_blacklist: false }) })^[[0m
14072 [^[[38;5;27m2024-06-03 16:41:21.920436 +08:00^[[0m] ^[[38;5;27mDEBUG^[[0m [src/sender/uniform_sender.rs:508] ^[[38;5;27m2-protolog-to-collector-sender sender send ite      m l7_log: BoxAppProtoLogsData(MetaAppProto { base_info: AppProtoLogsBaseInfo { start_time: 1717404074.013605s, end_time: 1717404074.013605s, flow_id: 5044072245550011      73, tap_port: eBPF@3532, signal_source: EBPF, vtap_id: 4, tap_type: Cloud, tap_side: ClientProcess, biz_type: 0, head: AppProtoHead { proto: MySQL, msg_type: Request,       rrt: 0 }, mac_src: 00:00:00:00:00:00, mac_dst: 00:00:00:00:00:00, ip_src: 172.20.0.3, ip_dst: 192.168.8.50, l3_epc_id_src: 1, l3_epc_id_dst: 1, port_src: 60810, port      _dst: 3306, req_tcp_seq: 2855153563, resp_tcp_seq: 0, gpid_0: 0, gpid_1: 0, ebpf_type: TracePoint, process_id_0: 3532, process_id_1: 0, process_kname_0: "java", proce      ss_kname_1: "", syscall_trace_id_request: 449519604096423296, syscall_trace_id_response: 0, syscall_trace_id_thread_0: 15027, syscall_trace_id_thread_1: 0, syscall_co      routine_0: 0, syscall_coroutine_1: 0, syscall_cap_seq_0: 14, syscall_cap_seq_1: 0, protocol: IpProtocol(6), is_vip_interface_src: false, is_vip_interface_dst: false,       pod_id_0: 0, pod_id_1: 0 }, direction: ClientToServer, direction_score: 255, l7_info: MysqlInfo(MysqlInfo { msg_type: Request, is_tls: false, protocol_version: 0, ser      ver_version: "", server_thread_id: 0, command: 3, context: "SELECT\n\t\t\tid, minip, maxip, continent, areacode,asnumber,adcode,radius,country, province, city, bd_lon       AS lng, bd_lat AS lat,zipcode,timezone,accuracy, owner,source,isp\n\t\tFROM youxin_geoip_ask_city\n\t\tWHERE minip<=29426691 ORDER BY minip DESC limit 1", response_co      de: 0, error_code: None, affected_rows: 0, error_message: "", status: Ok, rrt: 0, statement_id: 0, captured_request_byte: 253, captured_response_byte: 0, trace_id: No      ne, span_id: None, is_on_blacklist: false }) })^[[0m

我也大致看了一遍代码,相关代码应该是parser.rs 中MetaAppProto -- meta_packet.rs -- socket.c 内。 详细agent日志信息: [Uploading agent_cu.log…]()

  1. 主要是想确认一下为什么会有sql语句在调用链路时缺失,是否可以通过配置改善(比如是否哪里的超时时间?)
  2. 另外通过链路追踪api调用使用数据库中的(syscall_trace_id_response,syscall_tracd_id_request,req_tcp_seq)进行关联是否还有其他方式?
  3. 观察发现syscall_tracd_id_request和response有些时候是一个完全相等的id是什么情况? syscall_trace_id_request和response是完全呈链式(a-b,b-c)还是会出现(a-b, (b+1)-c)这样的情况么?

How to reproduce

采集JAVA服务的业务程序,通过页面查看或者数据库查看。

DeepFlow version

docker-compose部署的6.5.7

Name: deepflow-ctl community edition Branch: v6.5.7 CommitID: 9e76a00361a49647ed80419c47f714e27cc06254 RevCount: 10379 Compiler: go version go1.20.14 linux/amd64 CompileTime: 2024-05-24 02:26:56

DeepFlow agent list

root@node949:~# deepflow-ctl agent list 2024/06/03 08:39:06 open /run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory ID NAME TYPE CTRL_IP CTRL_MAC STATE GROUP EXCEPTIONS REVISION UPGRADE_REVISION
4 ubuntu-W1 CHOST_VM 192.168.8.157 92:12:d2:62:87:15 NORMAL legacy-host

Kubernetes CNI

No response

Operation-System/Kernel version

root@node949:~# awk -F '=' '/PRETTY_NAME/ { print $2 }' /etc/os-release "Ubuntu 22.04.4 LTS"

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

1473371932 commented 3 months ago

Hello, have you been unable to retrieve the complete SQL after several attempts? If the issue persists despite multiple attempts, please visit the DeepFlow System and check the DeepFlow Alert Analysis Dashboard to see if any data is available. This dashboard can check whether there is packet loss. If there is data, we will first solve the packet loss problem and then check again whether we can capture all SQL

kill51216 commented 3 months ago

是的,多次尝试后无法检索到完整的SQL,DeepFlow Alert Analysis里内数据都是No data. 日志中有debug信息有Packet类型的打印。 火狐截图_2024-06-04T09-56-49 715Z 火狐截图_2024-06-04T09-57-03 599Z

kill51216 commented 3 months ago

这个问题会和java 程序使用的是多线程并发&jdbc连接池操作mysql有关么?

taloric commented 3 months ago

@kill51216

  1. 主要是想确认一下为什么会有sql语句在调用链路时缺失,是否可以通过配置改善(比如是否哪里的超时时间?) 这个问题会和java 程序使用的是多线程并发&jdbc连接池操作mysql有关么?

是的,程序内多线程可见这个文档 的 #当前限制 中第一项说明,看下是否类似此场景(接收请求、向下游发送请求位于不同的线程) 可以看下mysql 文档,通过注释实现关联

  1. 另外通过链路追踪api调用使用数据库中的(syscall_trace_id_response,syscall_tracd_id_request,req_tcp_seq)进行关联是否还有其他方式?

分具体情况,如果是过 nginx 类网关可开启 x-request-id 能力串联网关出入 如果使用了 otel 一类的 apm 插桩可使用 trace_id 进行关联

  1. 观察发现syscall_tracd_id_request和response有些时候是一个完全相等的id是什么情况? syscall_trace_id_request和response是完全呈链式(a-b,b-c)还是会出现(a-b, (b+1)-c)这样的情况么?

syscall_tracd_id_request/response 完全相等一般发生在【请求终点】,比如一个 mysql 服务接收到请求之后,立刻在同线程内原地处理完毕并返回响应,这种时候就是一致的。 syscall_tracd_id_request/response 在同一应用内呈链式,在不同应用间无关系,形如: 请求链:(|a, a+1| -> |x, x+1| -> |k|) 响应链:(|u+1, u| <- |p+1, p| <- |k|)