getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.9k stars 4.17k forks source link

Crons: check-ins as events #43285

Open dcramer opened 1 year ago

dcramer commented 1 year ago

Problem Statement

We'd like to offer a set of functionality commonly associated with Sentry events, such as node data (contexts, tags, environment), as well as attachments. While we could brute force that into the check-ins models, it'd be a lot cleaner to just create a new event dataset.

Solution Brainstorm

Some thoughts on implementation details and progressive options.

End state:

The path to get there can be a few duct tapes:

getsantry[bot] commented 1 year ago

Routing to @getsentry/crons for triage, due by . ⏲️

dcramer commented 1 year ago

also fwiw, its totally ok (aka dont ask permission) to duct tape this in today and remove it later - it will let us validate UX and other concerns in parallel. We should just make sure search & storage + new issue platform work can support this use case, but to me this is a stanadrd Sentry Future use case.

volokluev commented 1 year ago

You say this is supposed to be validating a new UX. Is there a description I can read somewhere for what that UX actually looks like?

dcramer commented 1 year ago

@volokluev publishing stderr and debug information with a cron failure

nikhars commented 1 year ago

independent of these, we could create a new snuba dataset for these checkins, with the goal to remove the checkin instances - we'd keep the check-in API still, meaning attachments would just be rewritten, nodestore rewritten, etc to go to the universal datasets

The idea of removal of checkin events from Snuba is something which Clickhouse does not support well. Clickhouse is not designed for mutable datasets. (We do support mutability on the errors table today but that is restricted to 1-2 mutations per second and is a human generated operation as compared to something like cron job which would trigger this to happen more often)

dcramer commented 1 year ago

independent of these, we could create a new snuba dataset for these checkins, with the goal to remove the checkin instances - we'd keep the check-in API still, meaning attachments would just be rewritten, nodestore rewritten, etc to go to the universal datasets

The idea of removal of checkin events from Snuba is something which Clickhouse does not support well. Clickhouse is not designed for mutable datasets. (We do support mutability on the errors table today but that is restricted to 1-2 mutations per second and is a human generated operation as compared to something like cron job which would trigger this to happen more often)

we dont need check-ins removed outside of normal TTLs (check-ins are the events associated w/ a monitor, which is akin to a transaction name)

for clarity I mean we could remove the checkins postgres table we have today

evanh commented 1 year ago

I don't think putting stdout/stderr into context or tags is a good idea, particularly if we want to be able to search those fields in some way. Those are arbitrary large text objects, and searching that in the tags/contexts columns will be extremely slow. That implies to me storing them in a different way, depending on how we expect users to want to use those fields.

dcramer commented 1 year ago

@evanh no one is talking about stdout/stderr in context/tags - we're going to put those into attachments

evanh commented 1 year ago

no one is talking about stdout/stderr in context/tags - we're going to put those into attachments

My mistake. OK then I don't have much else to add. This new Dataset seems like a good idea, assuming we don't need to delete check-ins outside of TTLs.

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-crons for triage ⏲️