timmc-edx commented 3 months ago

We're observing bursts of RecursionError in the LMS relating to Python's gettext module.

The earliest known case was May 15, but we don't know how far back these actually go, due to telemetry limitations.

Looking at the June 6 burst using env:prod service:edx-edxapp-lms status:error error.type:builtins.RecursionError, I do not see any visible precipitating incident, although the end coincided with a deploy. No correlation with code owner, IP address, user ID, route, org, or course ID. 3 hosts were involved across 2 AZs.

The two instances I've spot-checked during the June 6 burst appear to involve this frame repeating:

  File "/usr/lib/python3.8/gettext.py", line 497, in gettext
    return self._fallback.gettext(message)

The earliest one I can find (trace 66587fa500000000d95d5070b24064bd on May 30 at 13:31:17.356 UTC) is a little different but still involves gettext and fallbacks:

  File "/usr/lib/python3.8/gettext.py", line 268, in add_fallback
    self._fallback.add_fallback(fallback)
  File "/usr/lib/python3.8/gettext.py", line 268, in add_fallback
    self._fallback.add_fallback(fallback)

To investigate

Given that this burst ended after a deploy and only involved a few hosts, is there a chance that this is analogous to a memory leak? Perhaps some code makes recursive calls, and this situation gets worse over time. Is something adding hundreds of fallbacks?
Any correlation with user agent or referrer?
django-countries shows up in some of these stack traces -- is it involved or just incident to them?

robrap commented 3 months ago

@timmc-edx: Is there any chance this could interfere with Datadog-related changes, and should it be prioritized? Or do you think it is fine to wait? Note that having the APM Profiler in place would be nice if this were a memory-related issue.

timmc-edx commented 3 months ago

I suspect it's an unrelated issue. It's troubling, but probably not our top priority.

timmc-edx commented 2 months ago

We recently saw another burst of RecursionError. This occurred on 7/28 UTC, a Sunday in the US, and there were no deploys around the start. This logs-based timeline shows a sustained burst starting around 21:30 and then slowly trailing off until around 03:00 the next day, with spikes at about 15 minutes past each hour (perhaps some automated job):

A zoom-in of the start time shows the entire ramp-up occurring from 21:35 to 21:40:

Note that the numbers are larger in this case, probably because logs are sampled and filtered less heavily than traces.

timmc-edx commented 2 months ago

One of the strangest things is that this issue has each time been limited to a few hosts at a time, but occurring almost simultaneously, and across availability zones.

A zoom-in of the start of the 7/28 event indicates that it affected the three hosts progressively, but in close succession (1-2 minutes apart):

This either indicates the hosts were responding to the same network event or were running into some resource limit around the same time.

(For reference: The hosts here were i-075567276c7863e51, i-06f30d88537d08101916, i-002f78eb6f8519bd9819)

edx / edx-arch-experiments

RecursionError observed in gettext calls #674

To investigate