DataDog / dd-trace-php

Datadog PHP Clients
https://docs.datadoghq.com/tracing/setup/php
Other
481 stars 149 forks source link

perf(profiling): remove execute_internal hook #2719

Open morrisonlevi opened 2 weeks ago

morrisonlevi commented 2 weeks ago

PROF-9982

Description

Background and Motivation

For wall- and CPU- time, we walk the call stack in the VM interrupt handler. Unfortunately, if an internal function call is the leaf frame at the time the interrupt was triggered, it gets popped off the call stack before the VM interrupt handler is triggered. To work around this, we add the zend_execute_internal hook and manually call the routine to collect the call stack while the internal frame was still on the call stack.

Using the zend_execute_internal hook has performance costs today. These costs are made even worse on the upcoming PHP 8.4 release because it adds a new optimization for frameless calls, and setting zend_execute_internal disables this optimization.

The goal of this task is to avoid installing this zend_execute_internal hook, while not regressing on performance, and without losing functionality--we definitely need that leaf frame. A lot of functions show up there, like regular expression and database related functions.

Implementation

This does some... uh... let's say "clever" things to avoid using the zend_execute_internal hook. In thread B which triggers the interrupt, it also saves the current value of thread A's EG(current_execute_data). Then when thread A handles its interrupt, it compares that against EG(vm_stack_top). If they are equal, then that's used as the top of the stack instead of the execute_data passed to the interrupt handler by the engine, and otherwise we just use execute_data as normal.

Reviewer checklist

pr-commenter[bot] commented 2 weeks ago

Benchmarks

Benchmark execution time: 2024-06-28 02:26:52

Comparing candidate commit 6d1ae69cf96f4d31fbcdb0ecf5e04988af578504 in PR branch levi/vm_stack_top with baseline commit 44a07ba98de2062986a33cc1592a2b87533441a0 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 27 metrics, 9 unstable metrics.

codecov-commenter commented 5 days ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 79.52%. Comparing base (44a07ba) to head (6d1ae69).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719/graphs/tree.svg?width=650&height=150&src=pr&token=eXio8H7vwF&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog)](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) ```diff @@ Coverage Diff @@ ## master #2719 +/- ## ============================================ + Coverage 79.24% 79.52% +0.28% Complexity 2216 2216 ============================================ Files 201 201 Lines 22595 22595 ============================================ + Hits 17905 17969 +64 + Misses 4690 4626 -64 ``` | [Flag](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) | Coverage Δ | | |---|---|---| | [tracer-extension](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) | `78.82% <ø> (ø)` | | | [tracer-php](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) | `80.54% <ø> (+0.69%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog#carryforward-flags-in-the-pull-request-comment) to find out more. [see 6 files with indirect coverage changes](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). Last update [44a07ba...6d1ae69](https://app.codecov.io/gh/DataDog/dd-trace-php/pull/2719?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DataDog).
morrisonlevi commented 4 days ago

It took quite a few randomized test runs but I managed to get a crash. Here's the top of the stack:

#0  datadog_php_profiling::profiling::stack_walking::detail::StringCache::get_or_insert (slot=0, f=..., self=<optimized out>)
    at profiling/src/profiling/stack_walking.rs:103
#1  datadog_php_profiling::profiling::stack_walking::detail::handle_function_cache_slot (func=0x7fbf16da3600, string_cache=<optimized out>)
    at profiling/src/profiling/stack_walking.rs:284
#2  datadog_php_profiling::profiling::stack_walking::detail::collect_call_frame (string_set=0x565364b5dba8, execute_data=<optimized out>)
    at profiling/src/profiling/stack_walking.rs:242
#3  datadog_php_profiling::profiling::stack_walking::detail::collect_stack_sample::{{closure}} (cell=<optimized out>) at profiling/src/profiling/stack_walking.rs:202
#4  std::thread::local::LocalKey<T>::try_with (self=<optimized out>, f=...) at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/thread/local.rs:270
#5  std::thread::local::LocalKey<T>::with (f=...) at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/thread/local.rs:246
#6  datadog_php_profiling::profiling::stack_walking::detail::collect_stack_sample (top_execute_data=<optimized out>) at profiling/src/profiling/stack_walking.rs:195
#7  0x00007fbf1d960303 in datadog_php_profiling::profiling::Profiler::collect_time (self=0x7fbf1deee698 <datadog_php_profiling::PROFILER+8>, 
    execute_data=0x23880000, interrupt_count=537535084) at profiling/src/profiling/mod.rs:693
#8  0x00007fbf1d961a1c in datadog_php_profiling::wall_time::ddog_php_prof_interrupt_function::{{closure}} (cell=<optimized out>) at profiling/src/wall_time.rs:76
#9  std::thread::local::LocalKey<T>::try_with (self=<optimized out>, f=...) at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/thread/local.rs:270
#10 std::thread::local::LocalKey<T>::with (f=...) at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/thread/local.rs:246
#11 datadog_php_profiling::wall_time::ddog_php_prof_interrupt_function (execute_data=0x7fbf1fc14090) at profiling/src/wall_time.rs:36
#12 0x00007fbf1eddc0fc in dd_uhook_end (invocation=44259, execute_data=0x7fbf1fc14090, retval=0x7ffde6a66b10, auxiliary=0x7fbf1fdb5550, dynamic=0x7fbf16aab1c0)
    at /home/circleci/datadog/tmp/build_extension/ext/hook/uhook.c:383

Basically, the tracer calls the profiler's interrupt function before calling a closure in an end hook. The profiler crashes when looking at the run time cache of the "risky" frame.

Havnen't yet identified why the run time cache is invalid.