alibaba / clusterdata

cluster data collected from production clusters in Alibaba for cluster management research
1.57k stars 405 forks source link

Question about 'UNAVAILABLE' and 'UNKNOWN' values in MSCallGraph data #204

Closed dufanrong closed 3 months ago

dufanrong commented 8 months ago

Subject: Question about 'UNAVAILABLE' and 'UNKNOWN' values in MSCallGraph data

Description:

Hello,

I hope this message finds you well. Firstly, I would like to express my sincere gratitude for open-sourcing the 'cluster-trace-microservices-v2022' dataset. It has been instrumental in my research efforts.

I have been exploring the MSCallGraph data, and I noticed that the 'um' column contains numerous 'UNAVAILABLE' and 'UNKNOWN' entries. Specifically, I have a few questions regarding this, illustrated with the following dataset snippet:

timestamp traceid service rpc_id rpctype um uminstanceid interface dm dminstanceid rt
666816332 T_20087312301 S_38528029 0 http USER USER 7H3KUQVcpx MS_3386 MS_3386_POD_1 3.0
666816333 T_20087312301 S_38528029 0.1 rpc UNAVAILABLE UNAVAILABLE 7H3KUQVcpx MS_70648 MS_70648_POD_170 1.0
666816333 T_20087312301 S_38528029 0.1.1 http UNKNOWN UNAVAILABLE 7H3KUQVcpx MS_38945 MS_38945_POD_107 0.0
666816333 T_20087312301 S_38528029 0.1.1.1 http MS_38945 MS_38945_POD_107 MaC4sWx7iJ MS_30732 MS_30732_POD_33 0.0

My questions are:

  1. What is the reason for the presence of 'UNAVAILABLE' and 'UNKNOWN' in the 'um' column? Specifically, how are records like the one above generated where 'rpc_id' seems complete, but 'um' is 'UNKNOWN' or 'UNAVAILABLE'?

  2. Is it possible to infer 'um' based on 'dm'? For example, can we repair the data by substituting 'um' based on the corresponding 'dm'? The repaired data might look like this:

timestamp traceid service rpc_id rpctype um uminstanceid interface dm dminstanceid rt
666816332 T_20087312301 S_38528029 0 http USER USER 7H3KUQVcpx MS_3386 MS_3386_POD_1 3.0
666816333 T_20087312301 S_38528029 0.1 rpc MS_3386 MS_3386_POD_1 7H3KUQVcpx MS_70648 MS_70648_POD_170 1.0
666816333 T_20087312301 S_38528029 0.1.1 http MS_70648 MS_70648_POD_170 7H3KUQVcpx MS_38945 MS_38945_POD_107 0.0
666816333 T_20087312301 S_38528029 0.1.1.1 http MS_38945 MS_38945_POD_107 MaC4sWx7iJ MS_30732 MS_30732_POD_33 0.0
  1. How is the entry point MS determined in the trace? If a trace contains data with 'rpc_id=0', is the entry point considered 'USER'? In cases where there is no 'rpc_id=0' data, and 'rpc_id' starts from '0.1' with 'um' being 'UNAVAILABLE' or 'UNKNOWN', like in this example, is 'rpc_id=0.1' considered the entry point MS?
timestamp traceid service rpc_id rpctype um uminstanceid interface dm dminstanceid rt
666803251 T_14180572390 S_38528029 0.1 rpc UNAVAILABLE UNAVAILABLE 7H3KUQVcpx MS_70648 MS_70648_POD_162 20.0
666803254 T_14180572390 S_38528029 0.1.1 http UNKNOWN UNAVAILABLE 7H3KUQVcpx MS_38945 MS_38945_POD_97 3.0
666803254 T_14180572390 S_38528029 0.1.1.1 http MS_38945 MS_38945_POD_97 ihrQqyYug4 MS_30732 MS_30732_POD_0 3.0
timestamp traceid service rpc_id rpctype um uminstanceid interface dm dminstanceid rt
666796398 T_1445790167 S_156482560 0.1 http UNKNOWN UNAVAILABLE bHiJXTZtx1 MS_40912 MS_40912_POD_290 2.0
666796398 T_1445790167 S_156482560 0.1.1 mc MS_40912 MS_40912_POD_290 ZaYxnA3U_f MS_58269 MS_58269_POD_25 1.0
666796399 T_1445790167 S_156482560 0.1.2 mc MS_40912 MS_40912_POD_290 ZaYxnA3U_f MS_58269 MS_58269_POD_5 0.0

I would appreciate any insights you can provide on these matters. Thank you for your time and for maintaining this valuable dataset.

Best regards