DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
550 stars 413 forks source link

rq integration logs an exception when tracing a job with result_ttl=0 #7985

Closed beepboop271 closed 10 months ago

beepboop271 commented 10 months ago

Summary of problem

When using the rq tracing integration, the rq worker emits the following exception in a warning log each time it processes a job with result_ttl=0:

Failed to set text tag 'job.status'
Traceback (most recent call last):
  File "/workspace/test/.venv/lib/python3.9/site-packages/ddtrace/span.py", line 385, in set_tag_str
    self._meta[key] = ensure_text(value, errors="replace")
  File "/workspace/test/.venv/lib/python3.9/site-packages/six.py", line 955, in ensure_text
    raise TypeError("not expecting type '%s'" % type(s))
TypeError: not expecting type '<class 'NoneType'>'

(More info after the reproduction steps. TL;DR the job is deleted as part of worker.perform_job when result_ttl=0, so when ddtrace tries to get the status of the job in wrapped_perform_job, it's None, which can't be passed to set_tag_str)

Which version of dd-trace-py are you using?

2.3.3

Which version of pip are you using?

pip 23.3.2 python 3.9.9

Which libraries and their versions are you using?

pip freeze:

async-timeout==4.0.3
attrs==23.1.0
bytecode==0.15.1
cattrs==23.2.3
click==8.1.7
ddsketch==2.0.4
ddtrace==2.3.3
Deprecated==1.2.14
envier==0.5.0
exceptiongroup==1.2.0
importlib-metadata==6.11.0
opentelemetry-api==1.22.0
protobuf==4.25.1
redis==5.0.1
rq==1.15.1
six==1.16.0
typing_extensions==4.9.0
wrapt==1.16.0
xmltodict==0.13.0
zipp==3.17.0

How can we reproduce your problem?

Minimal reproduction:

  1. Fresh venv
  2. pip install ddtrace rq
  3. touch app.py myjobs.py

app.py

from redis import Redis
from rq import Queue
from myjobs import do_work

q = Queue(connection=Redis())
q.enqueue(do_work)
q.enqueue(do_work, result_ttl=0)

myjobs.py

import time

def do_work():
    time.sleep(1)

#############
# since the exception in question is being caught and logged,
# wrap the logger to see where the log came from

import logging
import traceback

factory = logging.getLogRecordFactory()

def wrap(*args, **kwargs):
    record = factory(*args, **kwargs)
    if record.exc_info:
        print("--------------")
        print("logged an exception at:")
        traceback.print_stack(limit=8)
        print("--------------")
    return record

logging.setLogRecordFactory(wrap)
  1. start up the redis-server (version doesn't really play a role, but I have 5.0.7)
  2. in the same directory as app.py and myjobs.py, start up ddtrace-run rq worker
  3. run python app.py to send the two jobs to the worker (doesn't make a difference whether app.py is ddtraced or not)

What is the result that you get?

In the terminal running ddtrace-run rq worker:

23:02:12 Worker rq:worker:088aa9cc27674a1b814b0f06e09739c0 started with PID 77438, version 1.15.1
23:02:12 Subscribing to channel rq:pubsub:088aa9cc27674a1b814b0f06e09739c0
23:02:12 *** Listening on default...
failed to send, dropping 13 traces to intake at http://localhost:8126/v0.4/traces after 3 retries
23:02:15 default: myjobs.do_work() (c07877e6-ae37-4a09-8365-fb4c9d7036ae)
23:02:16 default: Job OK (c07877e6-ae37-4a09-8365-fb4c9d7036ae)
23:02:16 Result is kept for 500 seconds
23:02:16 default: myjobs.do_work() (74a20256-15eb-4248-9b45-f3a54aefc795)
23:02:17 default: Job OK (74a20256-15eb-4248-9b45-f3a54aefc795)
23:02:17 Result discarded immediately
--------------
logged an exception at:
  File "/workspace/test/.venv/lib/python3.9/site-packages/rq/worker.py", line 1310, in main_work_horse
    self.perform_job(job, queue)
  File "/workspace/test/.venv/lib/python3.9/site-packages/ddtrace/contrib/trace_utils.py", line 341, in wrapper
    return func(mod, pin, wrapped, instance, args, kwargs)
  File "/workspace/test/.venv/lib/python3.9/site-packages/ddtrace/contrib/rq/__init__.py", line 197, in traced_perform_job
    span.set_tag_str("job.status", job.get_status())
  File "/workspace/test/.venv/lib/python3.9/site-packages/ddtrace/span.py", line 389, in set_tag_str
    log.warning("Failed to set text tag '%s'", key, exc_info=True)
  File "/home/coder/.pyenv/versions/3.9.9/lib/python3.9/logging/__init__.py", line 1458, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/home/coder/.pyenv/versions/3.9.9/lib/python3.9/logging/__init__.py", line 1587, in _log
    record = self.makeRecord(self.name, level, fn, lno, msg, args,
  File "/home/coder/.pyenv/versions/3.9.9/lib/python3.9/logging/__init__.py", line 1556, in makeRecord
    rv = _logRecordFactory(name, level, fn, lno, msg, args, exc_info, func,
  File "/workspace/test/./myjobs.py", line 19, in wrap
    traceback.print_stack(limit=8)
--------------
Failed to set text tag 'job.status'
Traceback (most recent call last):
  File "/workspace/test/.venv/lib/python3.9/site-packages/ddtrace/span.py", line 385, in set_tag_str
    self._meta[key] = ensure_text(value, errors="replace")
  File "/workspace/test/.venv/lib/python3.9/site-packages/six.py", line 955, in ensure_text
    raise TypeError("not expecting type '%s'" % type(s))
TypeError: not expecting type '<class 'NoneType'>'

The first job does not cause anything unusual. However for the second job with result_ttl=0, a warning exception log Failed to set text tag 'job.status' is emitted. The log was logged from set_tag_str which was called by span.set_tag_str("job.status", job.get_status()) inside traced_perform_job.

https://github.com/DataDog/dd-trace-py/blob/4577f757cc8ab750d1fdbfae1fb767dcbacf6f22/ddtrace/contrib/rq/__init__.py#L192-L200

https://github.com/DataDog/dd-trace-py/blob/v2.4.0/ddtrace/contrib/rq/__init__.py#L192-L200

You can see it say the job result is "discarded immediately". It seems like when result_ttl=0 (i.e. don't need any results of a job, we just want its side effects), rq just entirely skips updating the status of a successful job (never sets it to "finished") and immediately deletes the job.

It's not really documented but here is a little bit of related docs: https://python-rq.org/docs/results/#return-value-ttl

You can see this behaviour in the source code of rq's worker.perform_job, I linked to 1.15.1 but it's been like this for a while,

So by the time ddtrace finishes the wrapped perform_job and gets to the finally block, the job has already been deleted.

What is the result that you expected?

No exceptions caused (raised/logged).

For example, usage of set_tag rather than set_tag_str, since status might be None? I'm not sure if there's another way to come up with a tag for the status that's more useful than None. For example, if job.get_status(refresh=False) was used in this situation, then the status would just return the old value of just "started" because as mentioned above, the "finished" status never gets set on the job.

mabdinur commented 10 months ago

Hi @beepboop271,

Thanks for the detailed bug report. I'll put up a fix for this. When the job status is not available we can just default to "None". I opened a PR to fix this behavior.