Crons: check-ins as events

dcramer commented 1 year ago

Problem Statement

We'd like to offer a set of functionality commonly associated with Sentry events, such as node data (contexts, tags, environment), as well as attachments. While we could brute force that into the check-ins models, it'd be a lot cleaner to just create a new event dataset.

Solution Brainstorm

Some thoughts on implementation details and progressive options.

End state:

a new defined event schema called a 'checkin' (idc on name). Basically similar to the check-in model, or the issue we create out of it.
a new issue type for "missed checkin" and "checkin failure". This is dependent on issue platform work.
stdout/stderr as optional logs via sentry-cli, possibly with the option to only capture on failure
environment details similar to a normal sdk via sentry-cli (maybe equiv of os.environ? maybe unsafe? figure this out, but at the very least sys.argv is useful)
we want trace IDs wherever possible

The path to get there can be a few duct tapes:

we could allow you to attach stdout/stderr to the cron checkin api, and simply throw them away on success. on error, that'd get funneled into the issue creation as an attachment object, and pushed through the pipeline.
we could attach a Node (confirm that TTLs still apply correctly) to the checkin model to store tags/env/etc
independent of these, we could create a new snuba dataset for these checkins, with the goal to remove the checkin instances - we'd keep the check-in API still, meaning attachments would just be rewritten, nodestore rewritten, etc to go to the universal datasets

getsantry[bot] commented 1 year ago

Routing to @getsentry/crons for triage, due by Wed Jan 18 2023 23:15:03 GMT+0000. ⏲️

dcramer commented 1 year ago

also fwiw, its totally ok (aka dont ask permission) to duct tape this in today and remove it later - it will let us validate UX and other concerns in parallel. We should just make sure search & storage + new issue platform work can support this use case, but to me this is a stanadrd Sentry Future use case.

volokluev commented 1 year ago

You say this is supposed to be validating a new UX. Is there a description I can read somewhere for what that UX actually looks like?

dcramer commented 1 year ago

@volokluev publishing stderr and debug information with a cron failure

nikhars commented 1 year ago

independent of these, we could create a new snuba dataset for these checkins, with the goal to remove the checkin instances - we'd keep the check-in API still, meaning attachments would just be rewritten, nodestore rewritten, etc to go to the universal datasets

The idea of removal of checkin events from Snuba is something which Clickhouse does not support well. Clickhouse is not designed for mutable datasets. (We do support mutability on the errors table today but that is restricted to 1-2 mutations per second and is a human generated operation as compared to something like cron job which would trigger this to happen more often)

dcramer commented 1 year ago

independent of these, we could create a new snuba dataset for these checkins, with the goal to remove the checkin instances - we'd keep the check-in API still, meaning attachments would just be rewritten, nodestore rewritten, etc to go to the universal datasets

The idea of removal of checkin events from Snuba is something which Clickhouse does not support well. Clickhouse is not designed for mutable datasets. (We do support mutability on the errors table today but that is restricted to 1-2 mutations per second and is a human generated operation as compared to something like cron job which would trigger this to happen more often)

we dont need check-ins removed outside of normal TTLs (check-ins are the events associated w/ a monitor, which is akin to a transaction name)

for clarity I mean we could remove the checkins postgres table we have today

evanh commented 1 year ago

I don't think putting stdout/stderr into context or tags is a good idea, particularly if we want to be able to search those fields in some way. Those are arbitrary large text objects, and searching that in the tags/contexts columns will be extremely slow. That implies to me storing them in a different way, depending on how we expect users to want to use those fields.

dcramer commented 1 year ago

@evanh no one is talking about stdout/stderr in context/tags - we're going to put those into attachments

evanh commented 1 year ago

no one is talking about stdout/stderr in context/tags - we're going to put those into attachments

My mistake. OK then I don't have much else to add. This new Dataset seems like a good idea, assuming we don't need to delete check-ins outside of TTLs.

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-crons for triage ⏲️

getsentry / sentry

Crons: check-ins as events #43285

Problem Statement

Solution Brainstorm