darklang / dark

Darklang main repo, including language, backend, and infra
https://darklang.com
Other
1.66k stars 90 forks source link

Move traces from the DB into cloud storage #3954

Open pbiggar opened 2 years ago

pbiggar commented 2 years ago

Plan:

Thoughts:

pbiggar commented 2 years ago

One issues here is that we need to "patch" traces when users are in the UI and press:

Possibly we could make the buttons create new traces but it makes more sense to have them change the current trace if it hasn't been fully executed already.

My sense is:

pbiggar commented 2 years ago

edit: move to later

pbiggar commented 2 years ago

It actually doesn't make sense to handle multiple iterations in the format, because in order to handle iterations (and nested iterations, etc) the format would be different enough to not handle the existing data. It would also need to change in the client. So instead this project should focus on getting the data we have into storage.

pbiggar commented 2 years ago

To see what I mean about the nested values, consider this program:

handler a:
  [1,2,3,4] |> List.map (\i -> b i)
fn b(i):
  if i % 2 == 0
  then b (i + 1)
  else c i
fun c(i):
  date.now()

So the function_results we want to store represent the callgraphs:

a (loop iteration 0) -> b -> c -> date::now
a (loop iteration 1) -> b -> b -> c -> date::now
a (loop iteration 2) -> b -> c -> date::now
a (loop iteration 3) -> b -> b -> c -> date::now

So to handle this, we need to actually represent the callgraph in the format. Which is fine, but then what do we do with the existing data?

pbiggar commented 2 years ago

Looking at the data in the DB and the way trace are returned, we need to handle traces for functions and traces for handlers.

Both have inputs:

Then the other important data is function_results, which are stored with enough info to identify the caller (id, fnname, argument hash), but not quite enough to know which iteration of calling is the right one).

We can avoid having function_arguments in the stored trace format by relying on the inputs and running analysis on them. However, we need to know which user functions are called in the traces. We do store that during analysis, so I think we should be able to attach that data as metadata?

pbiggar commented 2 years ago

We could look at the traces of the handlers pointing to this function, but we'd need to do some reasonably analysis to get to the root callers, and we wouldn't have traces that used to call the functions.

pbiggar commented 2 years ago

I don't see how we could do that while also supporting queries for traces.

pbiggar commented 2 years ago

We can use DaysSinceCustomTime for TTLs: https://cloud.google.com/storage/docs/lifecycle

We can do holds on specific traces: https://cloud.google.com/storage/docs/holding-objects#set-object-hold

pbiggar commented 2 years ago

When we load traces for a function, we need to have stored somewhere what traces have been saved for a function. We cannot use object metadata to search for this, as GCS doesn't support that.

The most obvious thing is to save it in the DB. So, for example, we could have the trace/function primary key, and so we can query by function, maybe ordering by date or something. It could also be a single traceid, with an array/set of tlids as a field in it.

The challenge is how to delete this data. It seems the answer is that you can get pubsub notifications for object deletion.

So in that case we'd delete the function association data when a trace is deleted.

pbiggar commented 2 years ago

Question:

But we still need to store metadata to ensure we have arguments for any functions. Given that we are traversing functions anyway, surely we should just have the same simple mechanism, being:

Then we can find the last 10 traces for each tlid and mark them to be saved, allowing GCS to delete the others.

This does store data in the DB, but I think we're not in a great position to avoid that.

The alternative would be to store each trace N times, where N is the number of functions.

A final alternative would be to store traces as {canvasID}/traces/{traceID} and store metadata as `{canvasID}/tlid/{tlid}/{timestamp}/{traceID}.

Then we could list them and sort them using the timestamp, getting the latest 10 for each.

Essentially, it does seem like storing traces by just traceID/canvasID is good. The remaining question is whether we want to use the DB or cloud storage for the garbage collection mechanism.

The garbage collection is slow and bad because we do big selects with big locks, and struggle to delete in bulk. But, if our approach is to garbage collect the traces based on what's in use, then the traces will delete, and then they can callback the DB to delete the trace metadata we're storing.

However, if the trace metadata is in the bucket, then the trace metadata can garbage collect itself, and we can traverse it to get the latest.

Ultimately, I think using a DB for this is simpler.

pbiggar commented 2 years ago

OK, so here's how it will all work:

pbiggar commented 2 years ago
pbiggar commented 2 years ago

For the execute buttons:

pbiggar commented 2 years ago

For 404s:

pbiggar commented 2 years ago

tracking

Initial setup

save data to the local emulator

Test can fetch uploaded traces and trace metadata

spike loading data in the client

Design for 404s

support execute_handler button

support execute function buttons

add garbage collection

finalize new implementation rough edges

monitoring

local support

migrate existing users

delete old tables

save for later

new format

optimizations

pbiggar commented 1 year ago

Idea for reducing the amount of stuff that uses the DB. We'd like to use cloud storage for more metadata, specifically for finding the last 10 traces for each handler in a canvas.

This will mean we can stay out of the DB more. It seems like the only metadata we need is the tlids of functions that are called so that users can see traces for those functions. The other roles could be done directly on GCS

To find the last 10 traces for a single TLID, we could:

Things in the DB now:

Most things can be done with the DB and even it's probably more convenient to do them with the DB, but for later scale we''d really prefer to keep as much out of the DB and going directly to cloud storage.

To make this usable, I propose the following format changes from my initial version:

pbiggar commented 1 year ago

Latest thinking:

But if we assume this for the future, then the changes we can make now are:

What we gain:

What we lose:

pbiggar commented 1 year ago
pbiggar commented 1 year ago
pbiggar commented 1 year ago
StachuDotNet commented 1 year ago

(context: garbage collection) a job that goes through all canvases and tlids and saves the last 10 by updating the TTL

We'll need a high degree of confidence to ensure this runs regularly and without issue, otherwise we can lose users' traces, right? Reminder to self: ensure we have through testing (and pagerduty) around this functionality


(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. :thinking:

pbiggar commented 1 year ago

(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. 🤔

I'm not sure what "caching" means. The plan is certainly for unit tests to be special traces.

StachuDotNet commented 1 year ago

The idea is mostly off topic, so I created a discussion for the base idea: https://github.com/darklang/dark/discussions/4673

Given that idea, The results of the 'cached' value don't necessarily need to be copied into every relevant trace. The above was me thinking through (vaguely) that it might be useful for these 'cache traces' to be stored separately. I'm not sure how the 'main' traces would reference those, though.

StachuDotNet commented 8 months ago

a few things don't yet work in the new trace system

(move this to a new issue)