pbiggar commented 2 years ago

Plan:

[x] do a spike to see what's involved

Thoughts:

separate 404s from traces (404s could also be in storage though)
need a naming scheme so we can avoid saving links to traces in the DB
- /bucket/{canvas_id}/{tlid}/{trace_id}.json
- /bucket/{canvas_id}/404s/{trace_id}.json
- do we need a timestamp in there or can they be sorted by metadata automatically?
garbage collection:
- each trace is assigned a TTL of lets say 10 days
- we want to keep X traces, plus all traces from the last week
- we want to support user pinning traces in the future
- we want to support traces being used for unit tests in the future
- a job that goes through all canvases and tlids and saves the last 10 by updating the TTL. Needs to run every few days at least.
- the default trace should not have a TTL set
ideally the client would download the trace without the server being involved, so it needs to fetch an authenticated url for it
we want to pick a good format, fully roundtrippable with no schema. Json for now for sure, we can use br to make it fast.
- automatically generated from a type
- include a version number
what about loops? I think the format should do loops correctly and let the client sort it out if it doesn't want them all
- we can continue to include hashed arguments for function results if needs be
we currently save all function arguments and results from pure files. I believe it is not necessary to change this for now - we can change amount of data stored as an optimization later
when we fetch traces for functions, it uses a bit of a different format from fetching traces for handlers. Since there will only be on trace for the entire execution, we'll probably need to handle it differently.

pbiggar commented 2 years ago

One issues here is that we need to "patch" traces when users are in the UI and press:

the "execute" button
the "execute handler" button
the "execute function" button

Possibly we could make the buttons create new traces but it makes more sense to have them change the current trace if it hasn't been fully executed already.

My sense is:

for the handler button, replace the whole trace
for the function button, replace the trace (I'm not sure that's the right answer so stick with the current behaviour i suppose)
for the execute button, save the execution in the DB? Or download the trace, parse it, and fix it?

pbiggar commented 2 years ago

edit: move to later

pbiggar commented 2 years ago

It actually doesn't make sense to handle multiple iterations in the format, because in order to handle iterations (and nested iterations, etc) the format would be different enough to not handle the existing data. It would also need to change in the client. So instead this project should focus on getting the data we have into storage.

pbiggar commented 2 years ago

To see what I mean about the nested values, consider this program:

handler a:
  [1,2,3,4] |> List.map (\i -> b i)
fn b(i):
  if i % 2 == 0
  then b (i + 1)
  else c i
fun c(i):
  date.now()

So the function_results we want to store represent the callgraphs:

a (loop iteration 0) -> b -> c -> date::now
a (loop iteration 1) -> b -> b -> c -> date::now
a (loop iteration 2) -> b -> c -> date::now
a (loop iteration 3) -> b -> b -> c -> date::now

So to handle this, we need to actually represent the callgraph in the format. Which is fine, but then what do we do with the existing data?

pbiggar commented 2 years ago

Looking at the data in the DB and the way trace are returned, we need to handle traces for functions and traces for handlers.

Both have inputs:

arguments for functions, saved in function_arguments
inputs for handlers (eg event, request), saved in stored_events

Then the other important data is function_results, which are stored with enough info to identify the caller (id, fnname, argument hash), but not quite enough to know which iteration of calling is the right one).

We can avoid having function_arguments in the stored trace format by relying on the inputs and running analysis on them. However, we need to know which user functions are called in the traces. We do store that during analysis, so I think we should be able to attach that data as metadata?

pbiggar commented 2 years ago

We could look at the traces of the handlers pointing to this function, but we'd need to do some reasonably analysis to get to the root callers, and we wouldn't have traces that used to call the functions.

pbiggar commented 2 years ago

[x] Should we use "versions" for storing multiple traces in a handler?

I don't see how we could do that while also supporting queries for traces.

pbiggar commented 2 years ago

We can use DaysSinceCustomTime for TTLs: https://cloud.google.com/storage/docs/lifecycle

We can do holds on specific traces: https://cloud.google.com/storage/docs/holding-objects#set-object-hold

pbiggar commented 2 years ago

When we load traces for a function, we need to have stored somewhere what traces have been saved for a function. We cannot use object metadata to search for this, as GCS doesn't support that.

The most obvious thing is to save it in the DB. So, for example, we could have the trace/function primary key, and so we can query by function, maybe ordering by date or something. It could also be a single traceid, with an array/set of tlids as a field in it.

The challenge is how to delete this data. It seems the answer is that you can get pubsub notifications for object deletion.

So in that case we'd delete the function association data when a trace is deleted.

pbiggar commented 2 years ago

Question:

do we store in the format {canvasID}/{tlid}/{timestamp}/{traceID} or just {canvasID}/{traceID}. If we store it by tlid/timestamp, then we can easily find the last 10 items just listing by prefix.

But we still need to store metadata to ensure we have arguments for any functions. Given that we are traversing functions anyway, surely we should just have the same simple mechanism, being:

a table with canvasID,traceID,tlid,timestamp tracking

Then we can find the last 10 traces for each tlid and mark them to be saved, allowing GCS to delete the others.

This does store data in the DB, but I think we're not in a great position to avoid that.

The alternative would be to store each trace N times, where N is the number of functions.

A final alternative would be to store traces as {canvasID}/traces/{traceID} and store metadata as `{canvasID}/tlid/{tlid}/{timestamp}/{traceID}.

Then we could list them and sort them using the timestamp, getting the latest 10 for each.

Essentially, it does seem like storing traces by just traceID/canvasID is good. The remaining question is whether we want to use the DB or cloud storage for the garbage collection mechanism.

The garbage collection is slow and bad because we do big selects with big locks, and struggle to delete in bulk. But, if our approach is to garbage collect the traces based on what's in use, then the traces will delete, and then they can callback the DB to delete the trace metadata we're storing.

However, if the trace metadata is in the bucket, then the trace metadata can garbage collect itself, and we can traverse it to get the latest.

Ultimately, I think using a DB for this is simpler.

pbiggar commented 2 years ago

OK, so here's how it will all work:

a single trace is collected by an execution. The trace contains:
- input of the root handler called
- function_results with the same data as currently available
we store this in cloud storage
- format {canvasID}/{traceID}
- we set a custom time on it
- we compress the data
we store metadata about who uses the trace in a new table in our DB
- canvasID, traceID, timestamp, list
- or maybe just canvasID, traceID, timestamp, tlid
- we can find the last 10 using this, for any toplevel
- this data can be deleted when the trace is deleted or when the canvas is deleted
it's garbage collected as follows:
- the bucket has an object lifecycle policy which deletes traces after X days from when the last custom date was set on it
- when a trace is created, it gets todays custom date
- with some frequency, let's say daily, we go through all valid canvases and toplevels, and get their last 10 traces. We mark those with the new custom date metadata
- GCS automatically deletes them
- when GCS deletes them, we receive a pubsub notification. We then use this to delete the trace/tlid connection metadata.

pbiggar commented 2 years ago

[x] come up with a plan for the 404s
[x] come up with a plan for the execute button

pbiggar commented 2 years ago

For the execute buttons:

For the execute handler button, we replace the entire trace. Simple. We might want to store a hash or something so the client can check if it's changed. Also send a push notification.
- this would also wipe out other layers
For the execute button, all we're doing is adding more function results. So add a "extension" trace to the storage with the same prefix
- for example, for the trace at {canvasID}/{traceID}
- add more function results at {canvasID}/{traceID}/{timestamp}/{randomNumber}
- when fetched, it can be layered on top to overwrite other function results
- for pinned traces or similar, or maybe if there's an error, we could reconstitute it, combining the layers to save space
For the execute_function button, we are again just adding more functions results, so these are also just layers
when deleting, we should remember to delete the layers.

pbiggar commented 2 years ago

For 404s:

store them using {canvasID}/404s/{timestamp}
include the path in the metadata? or the storage path?
store just the input
when converted to a trace, delete and rewrite it the normal way

pbiggar commented 2 years ago

tracking

Initial setup

[x] how do we know what traces to load for a function?
[x] add google cloud library
[x] add LD flag for tracer
[x] save trace based on LD
[x] ignore unexpected pushers in the client

save data to the local emulator

[x] store traces only for flagged handlers
[x] add storage emulator
[x] create format that has same data as now to upload to storage
- [x] save format version in metadata
- [x] create roundtrippable serializable format
- [x] create hash for function arguments for function result
[x] create cloud storage tracer
[x] upload at the end of standard execution
- [x] compress on upload

Test can fetch uploaded traces and trace metadata

[x] store affected tlids in the DB
- [x] add table
[x] store during trace upload
[x] test in production

spike loading data in the client

[x] add storage support in get_all_traces
[x] add storage support in fetch_trace_data
[x] fix hashing on the client, given we use a different algorithm

Design for 404s

[ ] figure out what needs to be done
[ ] allow get_all_404s to work
[ ] support adding 404s (move trace data to different object url, update traces_v0)

support execute_handler button

[ ] add support to apiserver
- [ ] replace trace
- [ ] delete old trace data
- [ ] return new data
- [ ] alert via pusher

support execute function buttons

[ ] implement exactly the same
[ ] add to apiserver
- [ ] add layer
- [ ] return result of button
[ ] alert via pusher
[ ] support when fetching trace
- [ ] fetch layers on client/trace load
[ ] delete layers for execute_handler button

add garbage collection

[ ] this about whether we're going to do this in F# or darklang (then update below)
[ ] write editor.darklang code to update TTLs on live traces
- [ ] daily cron, go through all canvases and handlers
[ ] set object lifecycle for bucket or for traces
[ ] delete layers too
[ ] set up pub-sub to send events to darklang editor
[ ] receive pubsub in dark-editor and delete trace metadata

finalize new implementation rough edges

[ ] do walkthrough and check it all works
[ ] ensure pusher is supported

monitoring

[ ] come up with way to determine this continues to work
- [ ] lifecycle deletion
- [ ] save traces crons
[ ] what monitoring (esp pagerduty alerts)
[ ] send daily emails with cloud storage size, count, delete count

local support

[ ] figure out how to make traces work in local development
- [ ] perhaps clone the handlers into a local dark-editor?
- [ ] perhaps have a "local" flag in requests?
[ ] maybe local testing?

migrate existing users

[ ] upload to both simultaneously
[ ] fetch and upload existing trace data for existing canvases/handlers
[ ] possibly automatically switch LD flag once this is done
[ ] switch all users to only use uploaded storage data
[ ] copy old tabel data to new tables with primary key
[ ] switch to fetching old data from new tables

delete old tables

[ ] delete old tables
[ ] continue DB migration

save for later

new format

[ ] figure out how to get a single trace working across multiple handlers
[ ] handle loops
- [ ] create format to support loops
- [ ] add UI to show loops
  editor.darklang.com handlers
[ ] add fetch_data handler
[ ] add get_all_traces handler
[ ] migrate execute_handler
[ ] migration execution_function
[ ] figure out how to do local testing

optimizations

[ ] try to bypass server when loading a trace (return authorized URL maybe)

pbiggar commented 1 year ago

Idea for reducing the amount of stuff that uses the DB. We'd like to use cloud storage for more metadata, specifically for finding the last 10 traces for each handler in a canvas.

This will mean we can stay out of the DB more. It seems like the only metadata we need is the tlids of functions that are called so that users can see traces for those functions. The other roles could be done directly on GCS

To find the last 10 traces for a single TLID, we could:

put the handler tlid in the url as a prefix
store the timestamp in the traceID (reversed to allow listing in reverse-chronological order)
find all traces for a handler would go direct (useful for deleting)
find all traces for a function would use the traces_v0 table

Things in the DB now:

listMostRecentTraceIDsForTLIDs
- convenient as 1 query, would be N queries to cloud storage
isInCloudStorage
- uses DB as it's fast, but this will go away
find last 10 traces for a handler (useful during GC)
- not expensive to do in the DB
- can be found directly from GCS
find traces for a function
- must be in the DB

Most things can be done with the DB and even it's probably more convenient to do them with the DB, but for later scale we''d really prefer to keep as much out of the DB and going directly to cloud storage.

To make this usable, I propose the following format changes from my initial version:

[x] name would be "{canvasID}/{handlerTLID}/{traceID}/{suffixID}"
[x] traces_v0 would have a list of TLIDs, ~no timestamp~
[x] traceID would be a ULID-ish thing with certain number of bits of timestamp, but encoded in a way that it's lexicographical sort order (as used by GCS) is latest-first
- [x] test lexicographic ordering is correct
- [x] test we can extract a timestamp from it
[x] convert the SQL
[x] convert cloud storage
[x] check the indexes are hit

pbiggar commented 1 year ago

Latest thinking:

what if we didn't keep around the last 10 traces for a function, and instead just kept around last 10 traces for all handlers
we would still keep around many traces for functions, just by virtue of keeping handler traces around
we could stop fetching all traces for the canvas, which isn't smart for scaling anyway, and just fetch them lazily for maybe the handlers on the screen or nearby or when you open a function
- this would be in the future as I don't want to change the interface right now.
- then we'd drop function arguments which aren't called anymore cause the calling code is gone, which is maybe useful (upsides and downsides, but intuitive)

But if we assume this for the future, then the changes we can make now are:

[x] store the callgraph tlids
[x] store the root tlids
[x] only search on root tlids for get_all_traces
[ ] only save root tlids during GC (can avoid the DB entirely)
[ ] functions would have a single roottlid for their default trace

What we gain:

much simpler DB tables and queries and GC
simpler model for what data we keep (easier to understand)
aligned for future scaling (show eg only traces on screen)

What we lose:

functions won't always have traces available
- this is kinda broken anyway

pbiggar commented 1 year ago

[ ] test functions have traces stored

pbiggar commented 1 year ago

[x] hash version vs cloud storage version?
[x] should the hash version or cloud storage version perhaps be stored in metadata in GCS

pbiggar commented 1 year ago

[x] should I just use a ULID for traceIDs (can they be converted seamlessly to System.Guid?)
- No because I want them inverted

StachuDotNet commented 1 year ago

(context: garbage collection) a job that goes through all canvases and tlids and saves the last 10 by updating the TTL

We'll need a high degree of confidence to ensure this runs regularly and without issue, otherwise we can lose users' traces, right? Reminder to self: ensure we have through testing (and pagerduty) around this functionality

(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. :thinking:

pbiggar commented 1 year ago

(no response needed, just thinking out loud) I wonder how this will play out once Dark implements "caching." Maybe those will turn out to have special traces of function calls, that can be referenced in 'normal 'traces. 🤔

I'm not sure what "caching" means. The plan is certainly for unit tests to be special traces.

StachuDotNet commented 1 year ago

The idea is mostly off topic, so I created a discussion for the base idea: https://github.com/darklang/dark/discussions/4673

Given that idea, The results of the 'cached' value don't necessarily need to be copied into every relevant trace. The above was me thinking through (vaguely) that it might be useful for these 'cache traces' to be stored separately. I'm not sure how the 'main' traces would reference those, though.

StachuDotNet commented 8 months ago

a few things don't yet work in the new trace system

404s
can't update an existing trace

(move this to a new issue)

darklang / dark

Move traces from the DB into cloud storage #3954

OK, so here's how it will all work:

For the execute buttons:

For 404s:

Initial setup

save data to the local emulator

Test can fetch uploaded traces and trace metadata

spike loading data in the client

Design for 404s

support execute_handler button

support execute function buttons

add garbage collection

finalize new implementation rough edges

monitoring

local support

migrate existing users

delete old tables

save for later

new format

editor.darklang.com handlers

optimizations