deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.87k stars 318 forks source link

[BUG] distributed tracing flame broken when APISIX request-id plugin enabled #5981

Closed curu closed 6 months ago

curu commented 6 months ago

Search before asking

DeepFlow Component

Grafana Dashbaord

What you expected to happen

demo link topo:

client(no_agent) ---gre_tunnel--->cvm-lab-ts4 --> cvm-lab-c7(apisix) --> 172.16.48.88(grpc server/no agent)

A demo client app set x-request-id using metadata

    for i in range(1):
        response = stub.SendString(example_pb2.StringRequest(text="helloworld %d" % i), metadata=(('x-request-id', 'req-%d-%d' % (time.time(),i)),))

when i enable request-id plugin in apisix, the flow log works correctly image but the flame graph panel broken, with the following error msg:

the expected status code returns 200, the actual return is 500, the parameter data is map[db:[flow_log] sql:[SELECT `session_length` AS `Session Total Bytes`, `request_length` AS `Request Total Bytes`, `response_length` AS `Response Total Bytes`, `sql_affected_rows` AS `SQL Affected Rows`, `direction_score` AS `Direction Score`, `response_duration` AS `Response Delay`, `metrics`, toString(_id), time, region_0, region_1, az_0, az_1, host_0, host_1, chost_0, chost_1, vpc_0, vpc_1, l2_vpc_0, l2_vpc_1, subnet_0, subnet_1, router_0, router_1, dhcpgw_0, dhcpgw_1, lb_0, lb_1, natgw_0, natgw_1, redis_0, redis_1, rds_0, rds_1, pod_cluster_0, pod_cluster_1, pod_ns_0, pod_ns_1, pod_node_0, pod_node_1, pod_service_0, pod_service_1, Enum(pod_group_type_0), Enum(pod_group_type_1), pod_group_0, pod_group_1, pod_0, pod_1, service_0, service_1, Enum(resource_gl0_type_0), Enum(resource_gl0_type_1), resource_gl0_0, resource_gl0_1, Enum(resource_gl1_type_0), Enum(resource_gl1_type_1), resource_gl1_0, resource_gl1_1, Enum(resource_gl2_type_0), Enum(resource_gl2_type_1), resource_gl2_0, resource_gl2_1, Enum(auto_instance_type_0), Enum(auto_instance_type_1), auto_instance_0, auto_instance_1, Enum(auto_service_type_0), Enum(auto_service_type_1), auto_service_0, auto_service_1, gprocess_0, gprocess_1, k8s.label_0, k8s.label_1, k8s.annotation_0, k8s.annotation_1, k8s.env_0, k8s.env_1, attribute, cloud.tag_0, cloud.tag_1, os.app_0, os.app_1, ip_0, ip_1, Enum(is_ipv4), is_internet_0, is_internet_1, Enum(protocol), Enum(tunnel_type), client_port, Enum(server_port), req_tcp_seq, resp_tcp_seq, Enum(l7_protocol), l7_protocol_str, Enum(is_tls), version, Enum(type), request_type, request_domain, request_resource, request_id, Enum(response_status), response_code, response_exception, response_result, events, app_service, app_instance, endpoint, process_id_0, process_id_1, process_kname_0, process_kname_1, trace_id, span_id, parent_span_id, Enum(span_kind), x_request_id_0, x_request_id_1, http_proxy_client, syscall_trace_id_request, syscall_trace_id_response, syscall_thread_0, syscall_thread_1, syscall_coroutine_0, syscall_coroutine_1, syscall_cap_seq_0, syscall_cap_seq_1, flow_id, toString(start_time) AS `start_time`, toString(end_time) AS `end_time`, Enum(signal_source), tap, vtap, Enum(nat_source), tap_port, tap_port_name, Enum(tap_port_type), Enum(tap_side), region_id_0, region_id_1, az_id_0, az_id_1, host_id_0, host_id_1, chost_id_0, chost_id_1, vpc_id_0, vpc_id_1, l2_vpc_id_0, l2_vpc_id_1, subnet_id_0, subnet_id_1, router_id_0, router_id_1, dhcpgw_id_0, dhcpgw_id_1, lb_id_0, lb_id_1, natgw_id_0, natgw_id_1, redis_id_0, redis_id_1, rds_id_0, rds_id_1, pod_cluster_id_0, pod_cluster_id_1, pod_ns_id_0, pod_ns_id_1, pod_node_id_0, pod_node_id_1, pod_service_id_0, pod_service_id_1, pod_group_id_0, pod_group_id_1, pod_id_0, pod_id_1, service_id_0, service_id_1, resource_gl0_id_0, resource_gl0_id_1, resource_gl1_id_0, resource_gl1_id_1, resource_gl2_id_0, resource_gl2_id_1, auto_instance_id_0, auto_instance_id_1, auto_service_id_0, auto_service_id_1, gprocess_id_0, gprocess_id_1, tap_id, vtap_id FROM l7_flow_log where order by `start_time`]], and the data is
Status: 400
Object
OPT_STATUS:"FAIL"
DESCRIPTION:"syntax error at position 3035 near 'order'"
result:null
debug:null

image

disable the request-id plugin in APISIX sovled the problem: image

How to reproduce

  1. setup a demo grpc client and server, and an APISIX between, enable the request-id plugin
  2. grpc client set x-request-id header via grpc metadata

DeepFlow version

Name: deepflow-server community edition Branch: v6.4 CommitID: b0e5ecf4a1b3ead78ef1a0cebef700d6774eef4a RevCount: 9762 Compiler: go version go1.20.14 linux/amd64 CompileTime: 2024-03-29 13:43:36

Name: deepflow-agent community edition Branch: v6.5.3 CommitId: cfb3560378f755d37bb9b9c1e41305483b0eff4e RevCount: 9935 Compiler: rustc 1.75.0 (82e1608df 2023-12-21) CompileTime: 2024-03-27 02:15:56

DeepFlow agent list

deepflow-ctl agent list ID NAME TYPE CTRL_IP CTRL_MAC STATE GROUP EXCEPTIONS REVISION UPGRADE_REVISION
1 172.16.49.16-V2 K8S_VM 172.16.49.16 52:54:00:66:87:0a NORMAL default v6.4 9760
3 cvm-lab-ts4-W4 CHOST_VM 172.16.112.22 52:54:00:d3:c7:8f NORMAL cvm v6.5.3 9935
4 cvm-lab-c7-W5 CHOST_VM 172.16.82.231 52:54:00:c8:81:2e NORMAL cvm v6.5.3 9935

Kubernetes CNI

No response

Operation-System/Kernel version

No response

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

taloric commented 6 months ago

@curu hello, such case( apisix with x-request-id) has been fixed in the lastest version of deepflow-app, you can update deepflow-app with latest image version and try it again. feel free to touch us if anything blocked you :)