cvkem / jaeger_stats

Parse Jaeger-json files in order to collect trace statistics
Apache License 2.0
1 stars 2 forks source link

jaeger_stats

The jaeger_stats is a library project focussed on handling and analyzing jaeger-traces. Jaeger-traces provide very detailled information. This is very useful for a detailled issue analysis. Hoevever this can also be a very useful source of information on how processes run in a complex microservices landscape and to gain insights how the landscape and the pressure on the individual service evolve over time.

This Jaeger_stats also contains a few tools (executables) build on top of the library to show-case how the tooling can be used, or even to use the tooling.

How to run an analysis

You can run the tool on a single Jaeger-trace via the command:

trace_analysis  <data_folder> 

Here data_folder can be an absolute or a relative path, however the expansion of '~' to a home-folder is not supported. The path-encoding needs to match the conventions of your system (Windows or Linux/Unix/Mac).

The tool will analyse all read all json-file in the folder (assuming these are valid Jaeger-trace files) and will process these files and compute statistics. Each json file can contains one or more traces. Output will be generated in the next folders:

Traces will be deduplicated before analysis based on the 'trace_id' so if the folder contains files that overlap in traces they contain this overlap is removed.

When you run the trace_analysis with flag --help you see:

$ trace_analysis --help
Parsing and analyzing Jaeger traces

Usage: trace_analysis [OPTIONS] <INPUT>

Arguments:
  <INPUT>  

Options:
      --caching-process <CACHING_PROCESS>

  -c, --call-chain-folder <CALL_CHAIN_FOLDER>
          The default source for call-chain information is a sub-folder'CallChain' located in the current folder [default: CallChain/]
  -z, --timezone-minutes <TIMEZONE_MINUTES>
          [default: 120]
  -f, --comma-float

  -t, --trace-output

  -o, --output-ext <OUTPUT_EXT>
          The output-extension determines the output-types are 'json' and 'bincode' (which is also used as the file-extension) [default: json]
  -h, --help
          Print help
  -V, --version
          Print version

The options are:

Contents of the files with statistics

The statistics files, such as 'Stats/cummulative_trace_stats.csv' use the ';' as the column separator. This file falls apart in four sections:

  1. Generic information such as, the list of trace_ids, the start_times of these traces and the average duration of these process
  2. Process-information: Lists all processes (services) in the call-chain and shows the number of inbound and outbound on this service. However it does not contain any details on the opertion being called)
  3. Process/operation: List the statistics like call-frequency, average time, max time, etc.. for each process/service
  4. Call-chain: List statistics for the full-call chain and also shows whether a service is a leaf-node or contains further downstream calls. Please note that the execution-time of a service/operation includes the execution time of all downstream calls performed. However, if you all heavy lifting is done in leaf-nodes the sum of the average time of the Leaf-nodes should come close to the average trace duration.

Correction of call-chains

Jaeger tracing spans are send over UDP, which is a protocol that does not give strong delivery guarantees. So occasionally a span might be lost which results in an incomplete trace, and thus broken call-chains in the trace. This is where the weird '-c' option pops up as seen in the previous example: trace_analysis <data_folder> -c <data_folder>/CallChain. Here the CallChain produced by the first run of the tool (only showing complete chains) will be used in the subsequent runs of the tool to correct incomplete call-chains for missing spans. However, the preferred option is to set up a separate folder to contain the call-chains, refer the '--call-chain-folder' or '-c' to this folder.

The call-chain corrections are only applied:

Correction of operations (path parameters)

Path parameters might wreak havoc on our analysis as path parameters make each URL unique while we are looking for averages over a number of invocations Therefore the system does correction on the URL's to extract the parameters, for example an order number and replaces that with a symbolic value '{ORDER}'. However, these replacements are currently hardcoded and we need to take some steps to make this configurable.

Computation of the rates (request/second)

If data is provided in a large batches it is possible to compute the rate from the data. However, we do not want to assume that all files with traces fall in the same time-period. Therefore we compute frequencies by computing times between subsequent calls and dropping the num_files largest intervals, as these might corresponds to gaps inbetween files. Based on this time the rate is computed as a frequency by the formula f=1/T where T is the duration in seconds between subsequent calls.

Extracting Jaeger JSON data

In the Jaeger web-based front end it is possible to make a selection of traces. After these traces have been returned you have two methods to extract the JSON files:

  1. Click on a single trace and in the right-top of the page select Download as 'JSON'.
  2. Open the developers tools and navigate to the network-tab. Now fire the request:
    1. Navigate to the response page. It might take some time to download the data and to transform and pretty-print the JSON. Select the full response and copy-paste it to a file
    2. Right-click on the response and select 'Copy Curl-URL' (for your system). Paste this URL in a console and redirect the output to a file. Using method 2.1 you can get approximately 1000 traces in a batch. The batch will be available as pretty-printed JSON in UTF8.

Method 2.2 allows you to select 1000 traces or more. However, the output a single line of raw json (not-pretty-printed) and the file is encoded in UTF-16-LE with BOM. The 'trace_analysis' can handle these files and will do an in-memory conversion to UTF8 before processing. Beware that this is a non-streaming conversion so the full file is in memory twice.

Using stitch-tool to merges results of different runs

The stitch tool is used to take a series of trace_analysis outputs and stitch them together to a single time-series analysis. The inputs are defined in a file 'input.stitch'.

The collected (time-series) output is written to a file 'stitch.csv' (default) which can easily read into Microsof Excel. The output contains (fine-grained) metrics-data as a time-series for all:

Next to the detailled output a file is generated that shows the anomalies (outliers) that have been detected.

When you run the 'stitch' with flag --help you see:

$ stitch -h`
Stitching results of different runs of trace_analysis into a single CSV for visualization in Excel

Usage: stitch [OPTIONS]

Options:
  -s, --stitch-list <STITCH_LIST>                      [default: input.stitch]
  -o, --output <OUTPUT>                                [default: stitched.csv]
  -a, --anomalies <ANOMALIES>                          [default: anomalies.csv]
  -c, --comma-float                                    
  -d, --drop-count <DROP_COUNT>                        [default: 0]
      --scaled-slope-bound <SCALED_SLOPE_BOUND>        [default: 0.05]
      --st-num-points <ST_NUM_POINTS>                  [default: 5]
      --scaled-st-slope-bound <SCALED_ST_SLOPE_BOUND>  [default: 0.05]
      --l1-dev-bound <L1_DEV_BOUND>                    [default: 2]
  -h, --help                                           Print help
  -V, --version                                        Print version

The options are:

An example of an input-file ('input.stitch') is:

#  comment line: this line is full ignored
/home/ceesvk/jaeger/batch/Stats/cummulative_trace_stats.json       # an absolute path
../../jaeger/get_order/Stats/cummulative_trace_stats.json    # a relative path
% ../../jaeger/post_order/Stats/cummulative_trace_stats.json  # This line is showing up as an empty column due to the % in front

# yet another comment (empty line above is ignored)

Beware that ALL files in the 'input.stitch' should exist and should be valid input files, otherwise the 'stitch' program will terminate with no output.

Extracting traces with the show_traces tool

When extracting datasets via Curl or other tools the Jaeger system returns up to 1000 traces in a single file. This file is in UTF-16-LE encoding instead of UTF-8 and is a JSON-file in a compact (minimized) format. Thus it is difficult to read these files, or to extract data out of them. For this purpose we proved the show_traces tool. It reads all jaeger-traces in a folder and then outputs these traces in a single file per trace in the folder 'Jaeger'. If are only interested in a few specific files you can provide the trace-ids of these files as a comma-separate list.

When you run the show_traces with flag --help you see:

Show the Jaeger-traces, or a selection of jaeger-traces, as Pretty-printed JSON in UTF-8 format

Usage: show_traces [OPTIONS] <INPUT>

Arguments:
  <INPUT>  

Options:
  -t, --trace-ids <TRACE_IDS>                The default sources is the current folder [default: ]
  -z, --timezone-minutes <TIMEZONE_MINUTES>  [default: 120]
  -h, --help                                 Print help
  -V, --version                              Print version

How to install the Jaeger_stats tools

the Jaeger_stats tooling is deployed to pypi.org as a Python project via an automated Github CI/CD pipeline. Thus the tools can be installed easily on Windows, Mac and Linux via the next command:

pip install jaeger_stats

If you need pre-releases of the tool you need to use:

pip install --pre --force-reinstall jaeger_stats

How to build trace_analysis (in Rust)

The tool is include in the examples folder and can be build via the command:

cargo build trace_analysis

The 'trace_analysis' executable can be found in 'target/debug/examples/trace_analysis'.

In case you need to process a large volume of traces you might aim for the more performant 'release' build (which also drops some run-time checks). To build a release version use:

cargo build --release trace_analysis

The 'trace_analysis' executable can be found in 'target/release/examples/trace_analysis'.

You can also install the tool via

cargo install --release trace_analysis

On linux this will deploy a release version of 'trace_analysis' in the folder '$HOME/.cargo/bin/' which is assumed to be included in your path.

License

This project is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), same as the Rust language.