getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.86k stars 483 forks source link

Missing `in_progress` checkin with Cron Monitoring #3279

Open SoerenWeber opened 1 month ago

SoerenWeber commented 1 month ago

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.8.0

Steps to Reproduce

I use Cron Monitoring to watch the execution of scheduled, long-running jobs, managed by RQ. The monitoring is done by using the monitor context manager:

with monitor(monitor_slug=settings.BACKUP_MONITOR_SLUG):
  pass

To mitigate potential connection problems, keep_alive is enabled in the sentry_sdk.init call. The same project runs in multiple instances on DigitalOcean infrastructure.

Expected Result

The monitor should send a check-in both for the start and end of execution. These check-ins should be visible on sentry.io.

Actual Result

The in_progress check-ins are pretty consistently missed, resulting in monitor failure. After the job ran, the ok check-in is sent successfully and resolves the failed monitor. Recent check-ins on sentry.io

After enabling debug in sentry_sdk.init, it seems that the in_progress check-in just isn't sent:

Jul 11 02:00:32 [worker] {"event": "default: project.tasks.create_external_backup() (a85d1596-974d-4b9b-8acb-01c4ce58e693)"}
 Jul 11 02:00:33 [worker] [sentry] DEBUG: [Tracing] Discarding <queue.task.rq> transaction <project.tasks.create_external_backup> because traces_sample_rate is set to 0
 Jul 11 02:00:33 [worker] Backup started.
 Jul 11 02:50:00 [worker] Backup done.
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] {"event": "default: Job OK (a85d1596-974d-4b9b-8acb-01c4ce58e693)"}
 Jul 11 02:50:00 [worker] {"event": "Result will never expire, clean up result key manually"}
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Discarding transaction because sampled = False
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Flushing HTTP transport
 Jul 11 02:50:00 [worker] [sentry] DEBUG: background worker got flush request
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (internal)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] [sentry] DEBUG: background worker flushed

Interestingly, there does seem to be some kind of networking issue because a bit earlier I found these entries, unrelated to the job mentioned above:

Jul 11 01:35:33 [worker] [sentry] DEBUG: Discarding transaction because sampled = False
 Jul 11 01:35:33 [worker] [sentry] DEBUG: Flushing HTTP transport
 Jul 11 01:35:33 [worker] [sentry] DEBUG: background worker got flush request
 Jul 11 01:35:33 [worker] [sentry] DEBUG: 2 event(s) pending on flush
 Jul 11 01:35:35 [worker] [sentry] ERROR: flush timed out, dropped 2 events
 Jul 11 01:35:35 [worker] [sentry] DEBUG: background worker flushed

Any guidance would be much appreciated! 🙏

szokeasaurusrex commented 1 month ago

@SoerenWeber, thanks for reporting this issue!

Has this ever worked properly for you, perhaps in a previous SDK version? If so, which SDK version was this still working with?

SoerenWeber commented 1 month ago

@szokeasaurusrex, thank you for your fast reaction!

I never got this working reliably, unfortunately. It seemed that adding the keep_alive flag improved the reliability slightly a few months ago, but the results were inconclusive (some check-ins seemed to succeed, but I could not find a pattern there).

szokeasaurusrex commented 1 month ago

@SoerenWeber, you're welcome! Are you able to share your source code repo with me (if it is public), or do you have a minimal reproduction you are able to share?

If not, that is okay; in that case, I can see whether I can reproduce the bug from the information you have already provided.

SoerenWeber commented 1 month ago

@szokeasaurusrex, I know it is a hassle, but I cannot share the code with you and do not have an MRE at the moment, sorry. I will try to provide one in the coming weeks, though.

I would greatly appreciate you looking into this, if you have the capacity. Otherwise, I will update this issue once I was able to narrow down the problem space.

szokeasaurusrex commented 1 month ago

@SoerenWeber After taking a closer look at your logs and after having tried to reproduce the issue, I wanted to share some of the thoughts I have.

From your logs, it does not look like the SDK is trying to send the in_progress check-in at all before your task starts. If the in_progress check-in event had been dropped for whatever reason, we would expect to see a log message somewhere mixed in the logs below stating that an event was dropped:

Jul 11 02:00:32 [worker] {"event": "default: project.tasks.create_external_backup() (a85d1596-974d-4b9b-8acb-01c4ce58e693)"}
 Jul 11 02:00:33 [worker] [sentry] DEBUG: [Tracing] Discarding <queue.task.rq> transaction <project.tasks.create_external_backup> because traces_sample_rate is set to 0
 Jul 11 02:00:33 [worker] Backup started.

The fact that there is no "event dropped" message here leads me to believe that the start of the task might not be instrumented at all, either because the SDK is not initialized or because the start of the task is not wrapped with the monitor decorator/context manager.

The other curiosity is that after the backup has finished, we see not one, but two check-ins being sent to Sentry:

Jul 11 02:50:00 [worker] Backup done.
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] {"event": "default: Job OK (a85d1596-974d-4b9b-8acb-01c4ce58e693)"}
 Jul 11 02:50:00 [worker] {"event": "Result will never expire, clean up result key manually"}
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Discarding transaction because sampled = False
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Flushing HTTP transport
 Jul 11 02:50:00 [worker] [sentry] DEBUG: background worker got flush request
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] [sentry] DEBUG: Sending envelope [envelope with 1 items (internal)] project:[redacted].ingest.us.sentry.io
 Jul 11 02:50:00 [worker] [sentry] DEBUG: background worker flushed

Typically, at the end of a monitored task, there would only be one check-in sent to indicate the task's success (or failure). The fact that we have two check-ins after the backup makes me think that perhaps the monitored portion is only being started after the backup completes, like so in pseudocode:

def backup_task():
    long_running_backup()
    print("Backup done")
    with sentry_sdk.monitor(...):
        quick_finishing_tasks()

The above snippet would be expected to produce output similar to what you are observing.


Please double check how you are instrumenting your monitor to make sure that (1) the SDK is initialized before hitting the monitor block, and that (2) the entire task (especially the long-running portion) is wrapped with monitor.

Once checking (1) and (2), if you are still experiencing the problem, it would be helpful to see the check-in data the SDK is trying to send to Sentry. You can get this information by adding a before_send function, which prints/logs check-in events, to your SDK configuration, like so:

import sentry_sdk

sentry_sdk.init(..., before_send=print_check_in_event)

def print_check_in_event(event, hint):
    if event.get("type") == "check_in":
        print(f"Sending check-in event: {event}")  # Or log instead of printing

    return event

Please share the logs generated by running your code with the above SDK config.

SoerenWeber commented 1 month ago

Thank you very much for your thoughts on this, @szokeasaurusrex. They were helpful on narrowing the problem down a bit, I think.

To address your points: The backup task is completely wrapped with the monitor context manager. Actually, this is the whole task:

def create_external_backup():
    with monitor(monitor_slug=settings.BACKUP_MONITOR_SLUG):
        subprocess.run(["/backup.sh"])

I also verified that the Sentry SDK is initialized before the task runs.


Things get more interesting using your suggested way of introspecting the check-ins using before_send, though! At first, logs looked like this:

[worker] [2024-07-15 18:15:44] {"event": "default: showbase.challenges.tasks.create_external_backup() (a280ea23-1484-4c9e-bab6-248ce573bdb3)"}
[...]
[worker] [2024-07-15 18:15:44] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': '77ab738bff024cfcac4ee19aed659bd5', 'status': 'in_progress', 'duration': None, 'environment': 'production', 'release': '1.22.0', 'event_id': '4d882010f46849189fe771850db17ffa', 'timestamp': '2024-07-15T18:15:44.645075Z', 'contexts': {'trace': {'trace_id': '4b35fd8bc10940b197da6b7d001b2036', 'span_id': '884e48e4338443ba', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': '4b35fd8bc10940b197da6b7d001b2036', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-f54d7d8f7-69j2q', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-15T18:15:44.652992Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-15 18:15:44]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 18:15:44] Database backup started.
[worker] [2024-07-15 18:16:00] Database backup done.
[worker] [2024-07-15 18:16:00] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': '77ab738bff024cfcac4ee19aed659bd5', 'status': 'ok', 'duration': 15.860358043995802, 'environment': 'production', 'release': '1.22.0', 'event_id': '223b32743f3b4754bdfd6065c9f91ff1', 'timestamp': '2024-07-15T18:16:00.505422Z', 'contexts': {'trace': {'trace_id': '4b35fd8bc10940b197da6b7d001b2036', 'span_id': '884e48e4338443ba', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': '4b35fd8bc10940b197da6b7d001b2036', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-f54d7d8f7-69j2q', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-15T18:16:00.506193Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 18:16:00] {"event": "default: Job OK (a280ea23-1484-4c9e-bab6-248ce573bdb3)"}
[worker] [2024-07-15 18:16:00] {"event": "Result will never expire, clean up result key manually"}
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: Discarding transaction because sampled = False
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: Flushing HTTP transport
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: background worker got flush request
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: Sending envelope [envelope with 1 items (internal)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 18:16:00]  [sentry] DEBUG: background worker flushed

After a few hours (and a connection loss of RQ to Redis, which might give a hint), it changed to this pattern:

[worker] [2024-07-15 19:15:44] {"event": "default: showbase.challenges.tasks.create_external_backup() (a280ea23-1484-4c9e-bab6-248ce573bdb3)"}
[...]
[worker] [2024-07-15 19:15:44] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': '5511ba25c37d4d26b1ca49d9321c2d27', 'status': 'in_progress', 'duration': None, 'environment': 'production', 'release': '1.22.0', 'event_id': 'f139ef1d5e944ca48c99ff47691db665', 'timestamp': '2024-07-15T19:15:44.802919Z', 'contexts': {'trace': {'trace_id': '16e51ec2852449678c15514a449b15f3', 'span_id': '9aaedda7e75d9823', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': '16e51ec2852449678c15514a449b15f3', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-f54d7d8f7-69j2q', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-15T19:15:44.805474Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-15 19:15:44] Database backup started.
[worker] [2024-07-15 19:15:51] Database backup done.
[worker] [2024-07-15 19:15:51] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': '5511ba25c37d4d26b1ca49d9321c2d27', 'status': 'ok', 'duration': 6.815664633002598, 'environment': 'production', 'release': '1.22.0', 'event_id': 'be85b83fabd34bfdad0e94eb5183a5d0', 'timestamp': '2024-07-15T19:15:51.618632Z', 'contexts': {'trace': {'trace_id': '16e51ec2852449678c15514a449b15f3', 'span_id': '9aaedda7e75d9823', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': '16e51ec2852449678c15514a449b15f3', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-f54d7d8f7-69j2q', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-15T19:15:51.619533Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 19:15:51] {"event": "default: Job OK (a280ea23-1484-4c9e-bab6-248ce573bdb3)"}
[worker] [2024-07-15 19:15:51] {"event": "Result will never expire, clean up result key manually"}
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: Discarding transaction because sampled = False
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: Flushing HTTP transport
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: background worker got flush request
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: Sending envelope [envelope with 1 items (internal)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-15 19:15:51]  [sentry] DEBUG: background worker flushed

Note that the in_progress check-in apparently isn't sent directly after the respective before_send call. Apparently, the sending of the check-in trails the ok check-in after the backup task has finished.

Around the time of this change, another type of log messages started to appear (related to other tasks). They indicate a timeout while flushing the HTTP transports:

[worker] [2024-07-15 21:11:45]  [sentry] DEBUG: Flushing HTTP transport
[worker] [2024-07-15 21:11:45]  [sentry] DEBUG: background worker got flush request
[worker] [2024-07-15 21:11:45]  [sentry] DEBUG: 2 event(s) pending on flush
[worker] [2024-07-15 21:11:47]  [sentry] ERROR: flush timed out, dropped 2 events

Do you have an idea what might be happening here? I will gladly help to further debug this.

szokeasaurusrex commented 1 month ago

Hmm I am not really sure what is going on, but yeah, I guess it looks like after the connection loss, the check-ins just are not getting sent by the transport when they should be sent.

Before the connection loss and change in the logs, were the check-ins showing up as being successful in Sentry, or were they still being marked as missed?

SoerenWeber commented 1 month ago

Sorry for the late response! I tried to get a better understanding of what is actually happening here and was not successful. I will need some more time to isolate this problem.

For testing purposes, I created a task that sleeps for 5 minutes every 10 minutes (*/10) and does nothing else. Cron Monitoring is set up with a grace period of 1 minute. At a first glance, most check-ins seem to work just fine. But there is a cluster of missed check-ins around midnight every day. Image

I can not see any difference between the successful and missed check-ins in the log. In both cases, they seem to follow the pattern we found before:

[worker] [2024-07-22 00:00:14] {"event": "default: showbase.challenges.tasks.create_external_backup() (acc6b5ff-747d-42d3-bd86-bbbeaf87c5ad)"}
[worker] [2024-07-22 00:00:14]  [sentry] DEBUG: [Tracing] Discarding <queue.task.rq> transaction <showbase.challenges.tasks.create_external_backup> because traces_sample_rate is set to 0
[worker] [2024-07-22 00:00:14]  [sentry] DEBUG: [Profiling] Discarding profile because transaction is discarded.
[worker] [2024-07-22 00:00:14] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': 'e534ae766e144a78937a1693ce2629be', 'status': 'in_progress', 'duration': None, 'environment': 'production', 'release': '1.22.0', 'event_id': '5a21b89719e1444292cd0e8feabe2be1', 'timestamp': '2024-07-22T00:00:14.298236Z', 'contexts': {'trace': {'trace_id': 'fc8ea0eacb0e471088faad61c0808849', 'span_id': 'bd8a14a102a60da2', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': 'fc8ea0eacb0e471088faad61c0808849', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-76f46dd8bb-gjvxx', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-22T00:00:14.303865Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-22 00:00:14] Starting task
[worker] [2024-07-22 00:05:14] Task completed
[worker] [2024-07-22 00:05:14] {"event": "Sending check-in event: {'type': 'check_in', 'monitor_slug': 'external-backup-staging', 'check_in_id': 'e534ae766e144a78937a1693ce2629be', 'status': 'ok', 'duration': 300.01391560200136, 'environment': 'production', 'release': '1.22.0', 'event_id': 'eb710d79a72744abbb8333c1df73c955', 'timestamp': '2024-07-22T00:05:14.312188Z', 'contexts': {'trace': {'trace_id': 'fc8ea0eacb0e471088faad61c0808849', 'span_id': 'bd8a14a102a60da2', 'parent_span_id': None, 'op': 'queue.task.rq', 'description': None, 'dynamic_sampling_context': {'trace_id': 'fc8ea0eacb0e471088faad61c0808849', 'environment': 'production', 'release': '1.22.0', 'public_key': '6751d6d635604a71a616c5a44859fa9c', 'transaction': 'showbase.challenges.tasks.create_external_backup', 'sample_rate': '0.0', 'sampled': 'false'}}}, 'server_name': 'worker-76f46dd8bb-gjvxx', 'sdk': {'name': 'sentry.python.django', 'version': '1.45.0', 'packages': [{'name': 'pypi:sentry-sdk', 'version': '1.45.0'}], 'integrations': ['aiohttp', 'argv', 'atexit', 'boto3', 'dedupe', 'django', 'excepthook', 'logging', 'modules', 'redis', 'rq', 'stdlib', 'threading']}, 'platform': 'python'}", "timestamp": "2024-07-22T00:05:14.313097Z", "logger": "config.settings.production.base", "level": "info"}
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-22 00:05:14] {"event": "default: Job OK (acc6b5ff-747d-42d3-bd86-bbbeaf87c5ad)"}
[worker] [2024-07-22 00:05:14] {"event": "Result will never expire, clean up result key manually"}
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: Discarding transaction because sampled = False
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: Flushing HTTP transport
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: background worker got flush request
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: Sending envelope [envelope with 1 items (monitor)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: Sending envelope [envelope with 1 items (internal)] project:4504222074470400 host:o951136.ingest.us.sentry.io
[worker] [2024-07-22 00:05:14]  [sentry] DEBUG: background worker flushed

Note that it seems as if the in_progress check-in trails again, but most check-ins are not marked as missed in Sentry (even though the run time of 5 minutes is longer than the grace period of 1 minute).

I noticed another curious thing in the "Recent check-ins" section in Sentry: Image Here, the start timestamp sometimes sits at 5 minutes increments (even though the task runs at 10 minutes increments). The "Expected at" column shows the same time for all events for many check-ins.

To me, every step of trying to nail this down led to more confusion. I will try to build a minimal example of this when I have the capacity.

szokeasaurusrex commented 1 month ago

Hi @SoerenWeber, thank you for the update. Yeah, unfortunately I am also unsure what is going on from this information. It is also possible that there is a bug in the Sentry backend, rather than in the SDK.

One small thing to note about what you said here:

Note that it seems as if the in_progress check-in trails again, but most check-ins are not marked as missed in Sentry (even though the run time of 5 minutes is longer than the grace period of 1 minute).

A check-in is marked as "missed" when the task starts more than 1 minute (or custom grace period) after the expected time. A check-in is not marked as "missed" if the task ends after the grace period, as long as it starts before the grace period ends. Therefore, what you are observing here is expected.

If a check-in starts on time, but then takes to long to execute, it is marked as "timed out," not "missed." The check-in gets marked as "timed out" if the task takes longer than the "Max runtime" (default 30 minutes).

Also, regarding this:

Here, the start timestamp sometimes sits at 5 minutes increments (even though the task runs at 10 minutes increments). The "Expected at" column shows the same time for all events for many check-ins

You may wish to open an issue in the Sentry repository, since this might be a problem with how the server is displaying the check in information. In your issue, please reference this issue and also include a link to the page in Sentry where you see the strange data.

antonpirker commented 1 month ago

Hey @SoerenWeber !

When I look at the last screenshot you posted I noticed: The "started" column states that the job is started in intervals between 5 and 15 minutes. This could confuse our server, because we would it to be run every 10 minutes. So if the first (on the bottom) is run at 11:00 we would expect the second and third come in at 11:10 and 11:20. But in reality they are started at 11:15 (too late) and 11:25 (too late for the third one and too early to be the forth one)

Maybe you can make sure that it is run every 10 minutes and see if this helps?

SoerenWeber commented 1 month ago

Hi @antonpirker, thank you for looking into this! Well, from my best understanding both, my missed check-in problem and the confusing times in the screenshot, are the result of a single issue I have with the SDK. I will try to paint a picture here:

Does this make sense?


Thank you for your thoughts on this, @szokeasaurusrex.

Regarding the missed check-ins:

A check-in is marked as "missed" when the task starts more than 1 minute (or custom grace period) after the expected time. A check-in is not marked as "missed" if the task ends after the grace period, as long as it starts before the grace period ends. Therefore, what you are observing here is expected.

If my understanding is correct, I would expect the current behavior to be evaluated as a missed check-in. An exemplary timeline would be:

  1. At 10:00 a job starts.
  2. Because of a bug, no in_progress check-in is sent yet.
  3. The job finishes at 10:05.
  4. Both, the ok and in_progress check-ins are sent.

Because the in_progress check-in is later than the grace period of 1 minute, the check-in should be marked as missed for at least 4 minutes until the ok check-in is received. Did I get something wrong here?

szokeasaurusrex commented 1 month ago

Because the in_progress check-in is later than the grace period of 1 minute, the check-in should be marked as missed for at least 4 minutes until the ok check-in is received. Did I get something wrong here?

@SoerenWeber I am unsure about the expected behavior here (perhaps @gaprl can provide some insight), but the SDK records the duration of the cron job and sends it with the ok check-in. So, only receives the ok check-in for a task, but not the in_progress check-in, Sentry would still be able to deduce whether the job started on time by subtracting the duration from the ok check-in's timestamp.

The check-in should definitely be getting marked as "missed" from one minute after the expected start time (assuming no in_progress check-in is received). But, it is quite possible that the check-in changes from "missed" to "okay" if the SDK sends an ok check-in indicating that the task did, in fact, start on time.

gaprl commented 1 month ago

Hey @SoerenWeber, could you please share the URL for the monitor in question? I'd be glad to take a look and see if I can figure things out from our end.

Feel free to reply to this message or email me at crons-feedback@sentry.io. Thanks.

SoerenWeber commented 1 month ago

Thanks, @gaprl! I emailed you with the monitor URL.

Regarding the late in_progress check-in, I took a quick look at the SDK source code. My current hypothesis is that something goes awry in the BackgroundWorker (maybe the thread is killed). From my cursory reading, this could result in a situation in which something is added to the queue that is not immediately processed by the thread. The next time something is added to the queue, another thread would be started that resumes processing all events.

szokeasaurusrex commented 1 month ago

Okay, thanks for that information. Let's see what @gaprl finds in the Sentry backend before we continue investigating from the SDK side