DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
532 stars 408 forks source link

SystemError: null argument to internal routine #8132

Closed dannosaur closed 7 months ago

dannosaur commented 8 months ago

Summary of problem

Random crashes in ddtrace. Unknown cause. Due to this being Cython, our error logger can't provide any context beyond the stack trace itself. I'm not sure what causes it, not how to work around it.

Which version of dd-trace-py are you using?

2.3.3 (we will upgrade to latest 2.4.x within the next week in production to validate whether this issue remains in later versions, and I'll report back when that's happened).

Which version of pip are you using?

23.3.1

Which libraries and their versions are you using?

`pip freeze` ably==2.0.0 amqp==5.2.0 annotated-types==0.6.0 anyio==4.2.0 arnparse==0.0.2 asgiref==3.7.2 async-timeout==4.0.3 attrs==23.2.0 Automat==22.10.0 balena-sdk==12.7.0 billiard==4.2.0 boto3==1.28.44 botocore==1.31.85 Brotli==1.1.0 bytecode==0.15.1 CacheControl==0.13.1 cachetools==5.3.2 cattrs==23.2.3 celery==5.3.6 certifi==2023.11.17 cffi==1.16.0 channels==4.0.0 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.3.0 constantly==23.10.4 cronex==0.1.3.1 cryptography==41.0.7 cssselect2==0.7.0 datadog-api-client==2.20.0 ddsketch==2.0.4 ddtrace==2.3.3 defusedxml==0.7.1 Deprecated==1.2.14 dicttoxml2==2.0.0 diff-match-patch==20230430 Django==4.2.9 django-ace==1.19.0 django-admin-autocomplete-list-filter @ git+https://github.com/jzmiller1/django-admin-autocomplete-list-filter.git@239fca057b9aa29e92806fbaf2bb955f9fa8bedd django-appconf==1.0.6 django-axes==6.0.4 django-cleanup==7.0.0 django-constance==2.9.1 django-cors-headers==4.0.0 django-datadog-logger==0.6.2 django-encrypted-json-fields==1.0.4 django-extensions==3.2.3 django-filter==23.2 django-health-check==3.17.0 django-imagekit==4.1.0 django-import-export==3.2.0 django-ipware==6.0.3 django-json-widget @ git+https://github.com/jzmiller1/django-json-widget.git@32c6acf9dd3cf27b43e0dd38a01e2af8e6c2ae10 django-large-image==0.9.0 django-map-widgets==0.4.1 django-otp==1.2.1 django-picklefield==3.1 django-relativedelta==2.0.0 django-split-settings==1.0.1 django-storages==1.13.2 django-templated-mail==1.1.1 django-threadlocals==0.10 django-timezone-field==5.1 django-treebeard==4.7 django-vectortiles==0.2.0 djangorestframework==3.14.0 djangorestframework-camel-case==1.3.0 djangorestframework-gis==1.0 drf-nested-routers==0.91 drf-spectacular==0.24.2 drf-writable-nested==0.6.2 elasticsearch==7.13.4 elasticsearch-dsl==7.4.0 elementpath==4.1.5 envier==0.5.0 et-xmlfile==1.1.0 filelock==3.13.1 firebase-admin==6.2.0 fluent-logger==0.10.0 fonttools==4.47.0 fpdf2==2.5.0 GDAL==3.6.2 geographiclib==1.52 geopy==2.2.0 google-api-core==2.15.0 google-api-python-client==2.111.0 google-auth==2.15.0 google-auth-httplib2==0.2.0 google-cloud-core==2.4.1 google-cloud-firestore==2.14.0 google-cloud-storage==2.11.0 google-crc32c==1.1.2 google-resumable-media==2.7.0 googleapis-common-protos==1.62.0 googlemaps==4.10.0 grpcio==1.56.0 grpcio-status==1.49.1 gunicorn==21.2.0 h11==0.14.0 h2==4.1.0 h3==3.7.6 hiredis==2.3.2 hpack==4.0.0 html5lib==1.1 httpcore==0.16.3 httplib2==0.22.0 httpx==0.23.3 humanize==4.9.0 hyperframe==6.0.1 hyperlink==21.0.0 idna==3.6 importlib-metadata==6.11.0 incremental==22.10.0 inflection==0.5.1 isodate==0.6.1 jmespath==1.0.1 JSON-log-formatter==0.5.2 jsonschema==4.20.0 jsonschema-specifications==2023.12.1 jwcrypto==1.5.1 kombu==5.3.4 large-image==1.19.3 large-image-source-gdal==1.19.3 Markdown==3.1.1 MarkupPy==1.14 mercantile==1.2.1 methoddispatch==3.0.2 msgpack==1.0.7 numpy==1.26.2 odfpy==1.4.1 openpyxl==3.1.2 opentelemetry-api==1.22.0 packaging==21.3 palettable==3.3.3 parameterized==0.8.1 phonenumbers==8.12.49 pilkit==3.0 Pillow==10.0.1 prompt-toolkit==3.0.43 proto-plus==1.23.0 protobuf==4.25.1 psutil==5.9.7 psycopg==3.1.9 psycopg-binary==3.1.9 pyasn1==0.5.1 pyasn1-modules==0.3.0 pycparser==2.21 pycurl==7.45.2 pydantic==2.1.1 pydantic_core==2.4.0 pydyf==0.8.0 pyee==9.1.1 pyhumps==3.5.3 PyJWT==2.6.0 pynamodb==5.5.0 pyOpenSSL==23.3.0 pyotp==2.9.0 pyparsing==2.4.7 pyphen==0.14.0 pyproj==3.6.1 pysaml2==7.4.2 python-dateutil==2.8.2 python-dotenv==1.0.0 python-ipware==2.0.1 python-json-logger==2.0.7 python-redis-lock==3.7.0 pytz==2021.1 PyYAML==6.0.1 qrcode==7.3.1 redis==4.5.4 referencing==0.32.0 requests==2.31.0 requests-aws4auth==1.0.1 rfc3986==1.5.0 rpds-py==0.16.2 rsa==4.9 s3transfer==0.6.2 semver==2.13.0 sentry-sdk==1.39.1 service-identity==23.1.0 simplejson==3.17.6 six==1.16.0 sniffio==1.3.0 sqlparse==0.4.4 tablib==3.5.0 tenacity==8.0.1 timezonefinder==6.2.0 tinycss2==1.2.1 twilio==7.17.0 Twisted==23.10.0 typing_extensions==4.9.0 tzdata==2023.4 uritemplate==4.1.1 urllib3==1.26.18 uvicorn==0.24.0.post1 vine==5.1.0 wcwidth==0.2.12 weasyprint==59.0 webencodings==0.5.1 websockets==10.4 wrapt==1.16.0 xlrd==2.0.1 xlwt==1.3.0 xmlschema==2.5.1 xmltodict==0.13.0 zipp==3.17.0 zope.interface==6.1 zopfli==0.2.3

How can we reproduce your problem?

No idea unfortunately

What is the result that you get?

SystemError: null argument to internal routine
  File "ddtrace/internal/periodic.py", line 56, in run
    self._target()
  File "ddtrace/profiling/collector/__init__.py", line 42, in periodic
    for events in self.collect():
  File "ddtrace/profiling/collector/stack.pyx", line 514, in ddtrace.profiling.collector.stack.StackCollector.collect
  File "ddtrace/profiling/collector/stack.pyx", line 352, in ddtrace.profiling.collector.stack.stack_collect
  File "ddtrace/profiling/collector/_traceback.pyx", line 54, in ddtrace.profiling.collector._traceback.pyframe_to_frames
  File "ddtrace/profiling/collector/_traceback.pyx", line 92, in ddtrace.profiling.collector._traceback.pyframe_to_frames
  File "ddtrace/profiling/collector/_traceback.pyx", line 20, in ddtrace.profiling.collector._traceback._extract_class_name

What is the result that you expected?

No error

Open to providing more info if you let me know what you need. This is a production environment, our dev environments while I have seen the issue crop up, is not nearly as frequent.

P403n1x87 commented 8 months ago

@dannosaur thanks for reporting this. What version of Python is being used here? Judging from the traceback attached, I'd expect the problem to occur with the latest version of the library too, unfortunately, but if you are going to verify we would love to hear the outcome! Can I also confirm that this is a celery-based Django application (inferred from the pip freeze)? Thanks.

dannosaur commented 7 months ago

We're running Python 3.11.7. It is indeed a celery-based django application. Celery 5.3.6 configured with the AWS SQS backend (boto3-based) if that matters.

P403n1x87 commented 7 months ago

@dannosaur Thanks for the extra details 🙏

dannosaur commented 7 months ago

Alright, I've recently deployed ddtrace 2.4.1 to our production environment, and as predicted the error persists. It happened pretty quickly post-deployment as well, and it happens several times a day. If you need to me to run any test builds, I'm more than happy to.

P403n1x87 commented 7 months ago

@dannosaur thanks for confirming. Based on the docs for SystemError, I'd be inclined to think that this problem is internal to the interpreter, and potentially caused by a combination of Celery + Cython. We can add a handling of this exception to prevent the profiler from crashing, hoping this is the only site where the problem occurs.

P403n1x87 commented 7 months ago

@dannosaur I have prepared #8184 to address this. It would be greatly appreciated if you could test this and confirm that it resolves the issue for you. Please let us know if it is OK for you to install directly from the branch, or if you would like us to provide you with a wheel (in this case we would need to know the target architecture and CPython version).

dannosaur commented 7 months ago

We should have all the necessary compilers in our container to build the wheels from source, so I can install the branch directly.

dannosaur commented 7 months ago

I've deployed the main branch (specific commit where this got merged) to our dev environment and it hasn't raised this error since. This will go to our production environment next week (as part of our regular release cadence) where I'll be able to report back something more concrete.

We have a couple of other internal ddtrace-related errors coming through our Sentry that could be related, I'm interested to know if this prevents crashing overall, or whether there's something else going on as well. If this fix doesn't resolve those issues, I'll raise separate issues for the individual errors we're seeing.

P403n1x87 commented 7 months ago

I've deployed the main branch (specific commit where this got merged) to our dev environment and it hasn't raised this error since. This will go to our production environment next week (as part of our regular release cadence) where I'll be able to report back something more concrete.

Thanks for confirming that these crashes seem resolved.

We have a couple of other internal ddtrace-related errors coming through our Sentry that could be related, I'm interested to know if this prevents crashing overall, or whether there's something else going on as well. If this fix doesn't resolve those issues, I'll raise separate issues for the individual errors we're seeing.

The fix prevents the SystemError from occurring, but is likely going to result in incomplete profiling data, in the form of missing class names where the frames that caused the issue in the first place are encountered. We expect these to be very rare occurrences, and therefore the impact on profiling data should be negligible. We would be very interested in knowing what the other internal errors that you have mentioned look like to see whether they are indeed related to this or other profiling issues. We would be very glad if you could share those details as well. Many thanks again for the collaboration thus far! 🙏

dannosaur commented 7 months ago

@P403n1x87 I've just put in issues #8128 and #8129 with the stack traces of the other errors we're seeing out of ddtrace. #8128 is far more common than the other one, and like this issue I have no context in the stack trace due to the Cython nature of this package, nor how they can be reproduced. All I know is that these errors are coming from our web frontend (gunicorn).