Open pbiggar opened 2 years ago
One issues here is that we need to "patch" traces when users are in the UI and press:
Possibly we could make the buttons create new traces but it makes more sense to have them change the current trace if it hasn't been fully executed already.
My sense is:
edit: move to later
It actually doesn't make sense to handle multiple iterations in the format, because in order to handle iterations (and nested iterations, etc) the format would be different enough to not handle the existing data. It would also need to change in the client. So instead this project should focus on getting the data we have into storage.
To see what I mean about the nested values, consider this program:
handler a:
[1,2,3,4] |> List.map (\i -> b i)
fn b(i):
if i % 2 == 0
then b (i + 1)
else c i
fun c(i):
date.now()
So the function_results
we want to store represent the callgraphs:
a (loop iteration 0) -> b -> c -> date::now
a (loop iteration 1) -> b -> b -> c -> date::now
a (loop iteration 2) -> b -> c -> date::now
a (loop iteration 3) -> b -> b -> c -> date::now
So to handle this, we need to actually represent the callgraph in the format. Which is fine, but then what do we do with the existing data?
Looking at the data in the DB and the way trace are returned, we need to handle traces for functions and traces for handlers.
Both have inputs:
event
, request
), saved in stored_eventsThen the other important data is function_results, which are stored with enough info to identify the caller (id, fnname, argument hash), but not quite enough to know which iteration of calling is the right one).
We can avoid having function_arguments in the stored trace format by relying on the inputs and running analysis on them. However, we need to know which user functions are called in the traces. We do store that during analysis, so I think we should be able to attach that data as metadata?
We could look at the traces of the handlers pointing to this function, but we'd need to do some reasonably analysis to get to the root callers, and we wouldn't have traces that used to call the functions.
I don't see how we could do that while also supporting queries for traces.
We can use DaysSinceCustomTime for TTLs: https://cloud.google.com/storage/docs/lifecycle
We can do holds on specific traces: https://cloud.google.com/storage/docs/holding-objects#set-object-hold
When we load traces for a function, we need to have stored somewhere what traces have been saved for a function. We cannot use object metadata to search for this, as GCS doesn't support that.
The most obvious thing is to save it in the DB. So, for example, we could have the trace/function primary key, and so we can query by function, maybe ordering by date or something. It could also be a single traceid, with an array/set of tlids as a field in it.
The challenge is how to delete this data. It seems the answer is that you can get pubsub notifications for object deletion.
So in that case we'd delete the function association data when a trace is deleted.
Question:
{canvasID}/{tlid}/{timestamp}/{traceID}
or just {canvasID}/{traceID}
. If we store it by tlid/timestamp, then we can easily find the last 10 items just listing by prefix.But we still need to store metadata to ensure we have arguments for any functions. Given that we are traversing functions anyway, surely we should just have the same simple mechanism, being:
Then we can find the last 10 traces for each tlid and mark them to be saved, allowing GCS to delete the others.
This does store data in the DB, but I think we're not in a great position to avoid that.
The alternative would be to store each trace N times, where N is the number of functions.
A final alternative would be to store traces as {canvasID}/traces/{traceID}
and store metadata as `{canvasID}/tlid/{tlid}/{timestamp}/{traceID}.
Then we could list them and sort them using the timestamp, getting the latest 10 for each.
Essentially, it does seem like storing traces by just traceID/canvasID is good. The remaining question is whether we want to use the DB or cloud storage for the garbage collection mechanism.
The garbage collection is slow and bad because we do big selects with big locks, and struggle to delete in bulk. But, if our approach is to garbage collect the traces based on what's in use, then the traces will delete, and then they can callback the DB to delete the trace metadata we're storing.
However, if the trace metadata is in the bucket, then the trace metadata can garbage collect itself, and we can traverse it to get the latest.
Ultimately, I think using a DB for this is simpler.
a single trace is collected by an execution. The trace contains:
we store this in cloud storage
{canvasID}/{traceID}
we store metadata about who uses the trace in a new table in our DB
it's garbage collected as follows:
{canvasID}/{traceID}
{canvasID}/{traceID}/{timestamp}/{randomNumber}
{canvasID}/404s/{timestamp}
tracking
Idea for reducing the amount of stuff that uses the DB. We'd like to use cloud storage for more metadata, specifically for finding the last 10 traces for each handler in a canvas.
This will mean we can stay out of the DB more. It seems like the only metadata we need is the tlids of functions that are called so that users can see traces for those functions. The other roles could be done directly on GCS
To find the last 10 traces for a single TLID, we could:
Things in the DB now:
Most things can be done with the DB and even it's probably more convenient to do them with the DB, but for later scale we''d really prefer to keep as much out of the DB and going directly to cloud storage.
To make this usable, I propose the following format changes from my initial version:
"{canvasID}/{handlerTLID}/{traceID}/{suffixID}"
Latest thinking:
But if we assume this for the future, then the changes we can make now are:
What we gain:
What we lose:
(context: garbage collection) a job that goes through all canvases and tlids and saves the last 10 by updating the TTL
We'll need a high degree of confidence to ensure this runs regularly and without issue, otherwise we can lose users' traces, right? Reminder to self: ensure we have through testing (and pagerduty) around this functionality
(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. :thinking:
(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. 🤔
I'm not sure what "caching" means. The plan is certainly for unit tests to be special traces.
The idea is mostly off topic, so I created a discussion for the base idea: https://github.com/darklang/dark/discussions/4673
Given that idea, The results of the 'cached' value don't necessarily need to be copied into every relevant trace. The above was me thinking through (vaguely) that it might be useful for these 'cache traces' to be stored separately. I'm not sure how the 'main' traces would reference those, though.
a few things don't yet work in the new trace system
(move this to a new issue)
Plan:
Thoughts: