Open zdyj3170101136 opened 9 months ago
Thank you for the report.
@DataDog/profiling-python
@zdyj3170101136 It looks like the profiles are not being uploaded to Datadog. Please note that some profiles might not contain CPU/Wall time samples. These metrics would then show up in aggregates, which are done automatically by Datadog.
1.18.6 is ok but 1.19.0 is not ok
Was this checked on multiple pprofs per versions? It might be that the pprof generated with 1.18 had CPU samples by chance. Would you be able to generate more pprofs to verify that the data you are looking for is consistently missing from them?
Thanks 🙏
@zdyj3170101136看起来配置文件没有上传到 Datadog。请注意,某些配置文件可能不包含 CPU/Wall time 示例。然后,这些指标将显示在聚合中,这是由 Datadog 自动完成的。
1.18.6 可以,但 1.19.0 不行
是否在每个版本的多个 pprof 上进行了检查?1.18 生成的 pprof 可能偶然有 CPU 样本。您是否能够生成更多 pprof 来验证您正在查找的数据是否始终缺失?
谢谢🙏
i upload it to local http server, and check it. 1.18.6 is always not have cpu samples. here is my code:
package main
import (
"io"
"log"
"mime"
"mime/multipart"
"net/http"
"os"
"strconv"
"strings"
"sync/atomic"
"time"
googleprofile "github.com/google/pprof/profile"
)
var w int32
func main() {
http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
atomic.AddInt32(&w, 1)
log.Println(request.URL.Path)
r, err := parseProfilingInputRequest(request)
log.Println(err)
for _, profile := range r.profiles {
os.WriteFile(profile.name+strconv.Itoa(int(atomic.LoadInt32(&w))), profile.data, 0666)
p, err := googleprofile.ParseData(profile.data)
if err == nil {
for i := range p.SampleType {
var ok bool
for _, sample := range p.Sample {
if sample.Value[i] != 0 {
ok = true
}
}
if ok {
log.Println(*p.SampleType[i])
}
}
}
}
})
http.ListenAndServe(":8081", nil)
}
// should sync with gopkg.in/!data!dog/dd-trace-go.v1@v1.37.0/profiler/upload.go
func parseProfilingInputRequest(req *http.Request) (*profilingInputRequest, error) {
var result profilingInputRequest
_, params, err := mime.ParseMediaType(req.Header.Get("Content-Type"))
if err != nil {
return nil, err
}
mr := multipart.NewReader(req.Body, params["boundary"])
for {
p, err := mr.NextPart()
if err == io.EOF {
break
}
if err != nil {
return nil, err
}
slurp, err := io.ReadAll(p)
if err != nil {
return nil, err
}
switch k := p.FormName(); k {
case "version":
result.version = string(slurp)
case "family":
result.family = string(slurp)
case "start":
result.start, err = parseTime(slurp)
case "end":
result.end, err = parseTime(slurp)
case "tags[]":
// python: [service:python_profiling_tutorial.py runtime-id:95c63439b302484984ef541a7b94c365 host:plat-sh-infra-prod-jaeger-collector003 language:python runtime:CPython runtime_version:3.7.7 profiler_version:1.5.3]
result.tags = append(result.tags, string(slurp))
case "runtime-id":
result.runtimeID = string(slurp)
default:
result.profiles = append(result.profiles, profile{
// fmt.Sprintf("data[%s]", name)
name: strings.TrimPrefix(strings.TrimSuffix(k, "]"), "data["),
data: slurp,
})
}
if err != nil {
return nil, err
}
}
result.containerID = req.Header.Get("Datadog-Container-ID")
return &result, nil
}
func parseTime(data []byte) (time.Time, error) {
return time.Parse(time.RFC3339, string(data))
}
type profile struct {
name string
data []byte
}
type profilingInputRequest struct {
start, end time.Time
version, family string
containerID string // go
runtimeID string // python
profiles []profile
tags []string
}
here is the output :
2023/10/07 10:35:48 /profiling/v1/input
2023/10/07 10:35:48 <nil>
2023/10/07 10:35:48 {alloc-samples count 0 0}
2023/10/07 10:35:48 {alloc-space bytes 0 0}
2023/10/07 10:35:48 {heap-space bytes 0 0}
2023/10/07 10:36:41 /profiling/v1/input
2023/10/07 10:36:41 <nil>
2023/10/07 10:36:41 {alloc-samples count 0 0}
2023/10/07 10:36:41 {alloc-space bytes 0 0}
2023/10/07 10:36:41 {heap-space bytes 0 0}
2023/10/07 10:37:40 /profiling/v1/input
2023/10/07 10:37:40 <nil>
2023/10/07 10:37:40 {alloc-samples count 0 0}
2023/10/07 10:37:40 {alloc-space bytes 0 0}
2023/10/07 10:37:40 {heap-space bytes 0 0}
on a typical json benchmark everything is fine both on 1.18.6 and 1.19.0~.
i start my app using gunicore, may be this is the reason.
i am wrong, the newly test seems that 1.12.0 is not ok and 1.11.0 is ok.
@P403n1x87 compare v1.11.0 and v1.12.0 i find the reason: https://github.com/DataDog/dd-trace-py/compare/v1.11.0...v1.12.0#diff-e8d0f6cf5f21b81feffc53d55c36e84e6abd4b10f06112f794c9c686d4a30560
in v1.12.0, it ignore the thread when thread name is none, and v1.11.0 not:
here is the pprof file sample label value in v1.11.0: thread name: Anonymous Thread 140664801060608
and in this line of code, it return 'anonymous thread' prefix as thread name when thread name is none.
https://github.com/DataDog/dd-trace-py/blob/fa3a486f09a168bb5cc94bbf3bc5494116aad796/ddtrace/profiling/exporter/pprof.pyx#L122
so in my application, all of thread name is none, so the ddtrace profiling ignore the thread and output with nothing.
in my application, all of thread name is none
May I check with you your use-case for having thread names set to None
? The reason we discard threads with no name is because we assume that it is an indication of a thread object being initialised, as by default every thread gets a name.
in my application, all of thread name is none
May I check with you your use-case for having thread names set to
None
? The reason we discard threads with no name is because we assume that it is an indication of a thread object being initialised, as by default every thread gets a name.
we use gunicorn and gevent. maybe this is reason?
@zdyj3170101136 thanks for the extra details. In our experience with gunicorn and gevent (at least with the versions that we support), we don't seem to have issues with thread names. In fact, when gevent monkey-patching is used, we would expect the application to have a single thread (the MainThread
), and the work distributed across tasks. These would be greenlets that are generally given a name by the gevent hub. The profiler has support for tasks (both asyncio tasks and greenlets) and we would report the name of the tasks.
Would you be able to provide us with the high-level structure of the application in question to give us an idea of how gunicorn + gevent are being used. Based on the list of dependencies that you have provided earlier, I'd assume it is to serve a Flask application. If you happen to have a small reproducer that you could share with us that would be great 🙏 .
Hi,
How about following:
from gevent import monkey
monkey.patch_all()
from ddtrace.profiling import Profiler
from ddtrace.profiling.collector import stack
p = Profiler()
p.start()
stack_collector = None
for c in p._collectors:
if isinstance(c, stack.StackCollector):
stack_collector = c
break
import time
while True:
r = stack_collector.collect()
print(r)
time.sleep(1.0)
If we call monkey.patch_all()
, this monkey patches threading library. Then the stack sampler gets the stack via sys._currentframes()
and then _threading.get_thread_name(pthread_id),
is used to get thread_name from the threading library with these tids. And we get None which results in empty profiles. This is because threading._active
and sys.currentframes
report different thread ids.
What might be the best fix here?
Additional finding:
When I put import ddtrace.auto
at the top, before gevent
monkey patches, it works.
@P403n1x87 , pinging maybe you missed my message above?
@P403n1x87 , pinging maybe you missed my message above?
Sorry @sumerc, I thought from your last comment that you had it figured out. As documented here, ddtrace.auto
needs to be imported as the very first thing when using gevent
without ddtrace-run
.
Oh ok! I just wanted to ensure this is the way for profiling gevent apps.
One final thing: don't you think this should work with:
import ddtrace.profiling.auto
Because, if I manually use Profiler APIs in my code, do I still need to include ddtrace.auto
? I am asking because ddtrace.auto
seems to be doing lots of things under the hood. Not sure if everything is necessary for profiling too.
Because, if I manually use Profiler APIs in my code, do I still need to include
ddtrace.auto
? I am asking becauseddtrace.auto
seems to be doing lots of things under the hood. Not sure if everything is necessary for profiling too.
That's a good point. The current design is to have ddtrace.auto
do pretty much the same job as the ddtrace-run
wrapper command. So you're right that this does more than ddtrace.profiling.auto
would do. Part of the job is also that of ensuring that we set up the runtime in a way that gives us compatibility with gevent
. So currently ddtrace.auto
would have to be imported as the very first thing, before ddtrace.profiling.auto
. The recommendation though is to use ddtrace-run
whenever possible.
Summary of problem
could not get cpu sample
Which version of dd-trace-py are you using?
1.20.2
Which version of pip are you using?
./pip --version pip 23.2.1 from /home/data/server/tokenization_mys/venv/lib/python3.7/site-packages/pip (python 3.7)
Which libraries and their versions are you using?
How can we reproduce your problem?
none
python version
Python 3.7.7
start command
What is the result that you get?
result pprof not have cpu samples, only have memory samples. auto.pprof.zip
What is the result that you expected?
have cpu samples
more information
i test more datadog version. 1.18.6 is ok but 1.19.0 is not ok