DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
507 stars 398 forks source link

could not get cpu sample #7085

Open zdyj3170101136 opened 9 months ago

zdyj3170101136 commented 9 months ago

Summary of problem

could not get cpu sample

Which version of dd-trace-py are you using?

1.20.2

Which version of pip are you using?

./pip --version pip 23.2.1 from /home/data/server/tokenization_mys/venv/lib/python3.7/site-packages/pip (python 3.7)

Which libraries and their versions are you using?

[root@plat-sh-data-testing-common-pic-infer-demo001 bin]# ./pip freeze
amqp==2.6.1
APScheduler==3.10.4
attrs==23.1.0
backcall==0.2.0
backports.zoneinfo==0.2.1
billiard==3.6.4.0
blinker==1.6.2
bytecode==0.13.0
cattrs==23.1.2
celery==4.4.7
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
dacite==1.7.0
ddsketch==2.0.4
ddtrace==1.20.2
decorator==5.1.1
Deprecated==1.2.14
envier==0.4.0
et-xmlfile==1.1.0
exceptiongroup==1.1.3
Flask==1.1.1
Flask-APScheduler==1.12.2
Flask-Sockets==0.2.1
gevent==21.12.0
gevent-websocket==0.10.1
greenlet==1.1.3.post0
gunicorn==21.2.0
h2==4.1.0
hanlp-downloader==0.0.25
hpack==4.0.0
hyperframe==6.0.1
idna==3.4
importlib-metadata==6.7.0
inotify==0.2.10
ipython==7.34.0
itsdangerous==1.1.0
jedi==0.19.0
Jinja2==2.11.3
JPype1==0.7.0
kombu==4.6.11
MarkupSafe==1.1.1
matplotlib-inline==0.1.6
neo-core==0.0.55
neo_metrics==0.0.3
netifaces==0.10.9
nose==1.3.7
numpy==1.19.0
nvidia-ml-py==12.535.108
openpyxl==3.0.10
opentelemetry-api==1.20.0
oss2==2.1.0
packaging==23.1
pandas==1.3.5
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
prometheus-client==0.17.1
prometheus-flask-exporter==0.21.0
prompt-toolkit==3.0.39
protobuf==4.24.3
psutil==5.9.5
ptyprocess==0.7.0
Pygments==2.16.1
pyhanlp==0.1.84
PyMySQL==1.1.0
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
requests==2.31.0
setproctitle==1.1.10
six==1.16.0
SQLAlchemy==1.3.24
traitlets==5.9.0
typeguard==2.11.0
typing_extensions==4.7.1
tzlocal==5.0.1
urllib3==1.26.0
vine==1.3.0
wcwidth==0.2.6
Werkzeug==1.0.1
wrapt==1.15.0
xmltodict==0.13.0
xxhash==3.3.0
zipp==3.15.0
zope.event==5.0
zope.interface==6.0

How can we reproduce your problem?

none

python version

Python 3.7.7

start command

from ddtrace.profiling import Profiler

prof = Profiler(
   url="http://parca-collector.private.mihoyo.com:7070", 
   enable_code_provenance=True, # provide library version
)
prof.start() # Should be as early as possible, eg before other imports, to ensure everything is profiled

What is the result that you get?

result pprof not have cpu samples, only have memory samples. auto.pprof.zip

What is the result that you expected?

have cpu samples

more information

i test more datadog version. 1.18.6 is ok but 1.19.0 is not ok

emmettbutler commented 9 months ago

Thank you for the report.

@DataDog/profiling-python

P403n1x87 commented 9 months ago

@zdyj3170101136 It looks like the profiles are not being uploaded to Datadog. Please note that some profiles might not contain CPU/Wall time samples. These metrics would then show up in aggregates, which are done automatically by Datadog.

1.18.6 is ok but 1.19.0 is not ok

Was this checked on multiple pprofs per versions? It might be that the pprof generated with 1.18 had CPU samples by chance. Would you be able to generate more pprofs to verify that the data you are looking for is consistently missing from them?

Thanks 🙏

zdyj3170101136 commented 9 months ago

@zdyj3170101136看起来配置文件没有上传到 Datadog。请注意,某些配置文件可能不包含 CPU/Wall time 示例。然后,这些指标将显示在聚合中,这是由 Datadog 自动完成的。

1.18.6 可以,但 1.19.0 不行

是否在每个版本的多个 pprof 上进行了检查?1.18 生成的 pprof 可能偶然有 CPU 样本。您是否能够生成更多 pprof 来验证您正在查找的数据是否始终缺失?

谢谢🙏

i upload it to local http server, and check it. 1.18.6 is always not have cpu samples. here is my code:

package main

import (
    "io"
    "log"
    "mime"
    "mime/multipart"
    "net/http"
    "os"
    "strconv"
    "strings"
    "sync/atomic"
    "time"

    googleprofile "github.com/google/pprof/profile"
)

var w int32

func main() {
    http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
        atomic.AddInt32(&w, 1)
        log.Println(request.URL.Path)
        r, err := parseProfilingInputRequest(request)
        log.Println(err)
        for _, profile := range r.profiles {
            os.WriteFile(profile.name+strconv.Itoa(int(atomic.LoadInt32(&w))), profile.data, 0666)
            p, err := googleprofile.ParseData(profile.data)
            if err == nil {
                for i := range p.SampleType {
                    var ok bool
                    for _, sample := range p.Sample {
                        if sample.Value[i] != 0 {
                            ok = true
                        }
                    }
                    if ok {
                        log.Println(*p.SampleType[i])
                    }
                }
            }

        }
    })
    http.ListenAndServe(":8081", nil)
}

// should sync with gopkg.in/!data!dog/dd-trace-go.v1@v1.37.0/profiler/upload.go
func parseProfilingInputRequest(req *http.Request) (*profilingInputRequest, error) {
    var result profilingInputRequest

    _, params, err := mime.ParseMediaType(req.Header.Get("Content-Type"))
    if err != nil {
        return nil, err
    }
    mr := multipart.NewReader(req.Body, params["boundary"])
    for {
        p, err := mr.NextPart()
        if err == io.EOF {
            break
        }
        if err != nil {
            return nil, err
        }
        slurp, err := io.ReadAll(p)
        if err != nil {
            return nil, err
        }

        switch k := p.FormName(); k {
        case "version":
            result.version = string(slurp)
        case "family":
            result.family = string(slurp)
        case "start":
            result.start, err = parseTime(slurp)
        case "end":
            result.end, err = parseTime(slurp)
        case "tags[]":
            // python: [service:python_profiling_tutorial.py runtime-id:95c63439b302484984ef541a7b94c365 host:plat-sh-infra-prod-jaeger-collector003 language:python runtime:CPython runtime_version:3.7.7 profiler_version:1.5.3]
            result.tags = append(result.tags, string(slurp))
        case "runtime-id":
            result.runtimeID = string(slurp)
        default:
            result.profiles = append(result.profiles, profile{
                // fmt.Sprintf("data[%s]", name)
                name: strings.TrimPrefix(strings.TrimSuffix(k, "]"), "data["),
                data: slurp,
            })
        }
        if err != nil {
            return nil, err
        }
    }
    result.containerID = req.Header.Get("Datadog-Container-ID")
    return &result, nil
}

func parseTime(data []byte) (time.Time, error) {
    return time.Parse(time.RFC3339, string(data))
}

type profile struct {
    name string
    data []byte
}

type profilingInputRequest struct {
    start, end      time.Time
    version, family string

    containerID string // go
    runtimeID   string // python
    profiles    []profile
    tags        []string
}

here is the output :

2023/10/07 10:35:48 /profiling/v1/input
2023/10/07 10:35:48 <nil>
2023/10/07 10:35:48 {alloc-samples count 0 0}
2023/10/07 10:35:48 {alloc-space bytes 0 0}
2023/10/07 10:35:48 {heap-space bytes 0 0}
2023/10/07 10:36:41 /profiling/v1/input
2023/10/07 10:36:41 <nil>
2023/10/07 10:36:41 {alloc-samples count 0 0}
2023/10/07 10:36:41 {alloc-space bytes 0 0}
2023/10/07 10:36:41 {heap-space bytes 0 0}
2023/10/07 10:37:40 /profiling/v1/input
2023/10/07 10:37:40 <nil>
2023/10/07 10:37:40 {alloc-samples count 0 0}
2023/10/07 10:37:40 {alloc-space bytes 0 0}
2023/10/07 10:37:40 {heap-space bytes 0 0}
zdyj3170101136 commented 9 months ago

on a typical json benchmark everything is fine both on 1.18.6 and 1.19.0~.

i start my app using gunicore, may be this is the reason.

zdyj3170101136 commented 9 months ago

i am wrong, the newly test seems that 1.12.0 is not ok and 1.11.0 is ok.

zdyj3170101136 commented 9 months ago

@P403n1x87 compare v1.11.0 and v1.12.0 i find the reason: https://github.com/DataDog/dd-trace-py/compare/v1.11.0...v1.12.0#diff-e8d0f6cf5f21b81feffc53d55c36e84e6abd4b10f06112f794c9c686d4a30560

in v1.12.0, it ignore the thread when thread name is none, and v1.11.0 not: 截屏2023-10-12 下午3 12 50

here is the pprof file sample label value in v1.11.0: thread name: Anonymous Thread 140664801060608

and in this line of code, it return 'anonymous thread' prefix as thread name when thread name is none. https://github.com/DataDog/dd-trace-py/blob/fa3a486f09a168bb5cc94bbf3bc5494116aad796/ddtrace/profiling/exporter/pprof.pyx#L122 截屏2023-10-12 下午3 18 26

so in my application, all of thread name is none, so the ddtrace profiling ignore the thread and output with nothing.

P403n1x87 commented 8 months ago

in my application, all of thread name is none

May I check with you your use-case for having thread names set to None? The reason we discard threads with no name is because we assume that it is an indication of a thread object being initialised, as by default every thread gets a name.

zdyj3170101136 commented 8 months ago

in my application, all of thread name is none

May I check with you your use-case for having thread names set to None? The reason we discard threads with no name is because we assume that it is an indication of a thread object being initialised, as by default every thread gets a name.

we use gunicorn and gevent. maybe this is reason?

P403n1x87 commented 8 months ago

@zdyj3170101136 thanks for the extra details. In our experience with gunicorn and gevent (at least with the versions that we support), we don't seem to have issues with thread names. In fact, when gevent monkey-patching is used, we would expect the application to have a single thread (the MainThread), and the work distributed across tasks. These would be greenlets that are generally given a name by the gevent hub. The profiler has support for tasks (both asyncio tasks and greenlets) and we would report the name of the tasks.

Would you be able to provide us with the high-level structure of the application in question to give us an idea of how gunicorn + gevent are being used. Based on the list of dependencies that you have provided earlier, I'd assume it is to serve a Flask application. If you happen to have a small reproducer that you could share with us that would be great 🙏 .

sumerc commented 4 months ago

Hi,

How about following:

from gevent import monkey
monkey.patch_all()

from ddtrace.profiling import Profiler
from ddtrace.profiling.collector import stack

p = Profiler()
p.start()

stack_collector = None
for c in p._collectors:
    if isinstance(c, stack.StackCollector):
        stack_collector = c
        break

import time
while True:
    r = stack_collector.collect()
    print(r)
    time.sleep(1.0)

Possible root cause

If we call monkey.patch_all(), this monkey patches threading library. Then the stack sampler gets the stack via sys._currentframes() and then _threading.get_thread_name(pthread_id), is used to get thread_name from the threading library with these tids. And we get None which results in empty profiles. This is because threading._active and sys.currentframes report different thread ids.

What might be the best fix here?

sumerc commented 4 months ago

Additional finding:

When I put import ddtrace.auto at the top, before gevent monkey patches, it works.

sumerc commented 4 months ago

@P403n1x87 , pinging maybe you missed my message above?

P403n1x87 commented 4 months ago

@P403n1x87 , pinging maybe you missed my message above?

Sorry @sumerc, I thought from your last comment that you had it figured out. As documented here, ddtrace.auto needs to be imported as the very first thing when using gevent without ddtrace-run.

sumerc commented 3 months ago

Oh ok! I just wanted to ensure this is the way for profiling gevent apps.

One final thing: don't you think this should work with:

import ddtrace.profiling.auto

Because, if I manually use Profiler APIs in my code, do I still need to include ddtrace.auto? I am asking because ddtrace.auto seems to be doing lots of things under the hood. Not sure if everything is necessary for profiling too.

P403n1x87 commented 3 months ago

Because, if I manually use Profiler APIs in my code, do I still need to include ddtrace.auto? I am asking because ddtrace.auto seems to be doing lots of things under the hood. Not sure if everything is necessary for profiling too.

That's a good point. The current design is to have ddtrace.auto do pretty much the same job as the ddtrace-run wrapper command. So you're right that this does more than ddtrace.profiling.auto would do. Part of the job is also that of ensuring that we set up the runtime in a way that gives us compatibility with gevent. So currently ddtrace.auto would have to be imported as the very first thing, before ddtrace.profiling.auto. The recommendation though is to use ddtrace-run whenever possible.