Change the storage system to maintain metrics between sessions

rafeca commented 5 years ago

From https://github.com/atom/telemetry/pull/25, we need to move away from lokijs and change the storage for events not yet sent on the telemetry package:

After looking at this I just got curious about lokijs and how it implemented the storage of events and from what I saw by default lokijs is an in-memory database and does not persist the data by default (so each window currently has its own list of events).

This means that we're currently losing all the events that haven't been sent once the Atom window gets closed, and since we're only sending events once Atom has been opened for 4h, we're losing any event from sessions shorter than 4h.

Using a persistance adapter on LokiJS (as mentioned in the docs) would fix this specific problem, but then we would introduce the concurrency issues when having multiple windows open, since lokijs only syncs the data from the persistance adapter on load:

An important distinction between an in-memory database like lokijs and traditional database systems is that all documents/records are kept in memory and are not loaded as needed. Persistence is therefore only for saving and restoring the state of this in-memory database.

So I see two potential paths forward (there may be more options):

Use a global db instance for all the windows but move away from lokijs and use a system that ensures that multiple concurrent clients are handled correctly (even by accessing IndexedDB directly). This could cause some performance regressions if we have many events, since for every read/write we would need to access the storage system (and serialize/deserialize the data).

Keep using lokijs as an in-memory DB (so each session has its own instance) but persist the data that has not been sent when Atom is closed. Then re-read the unsent data (and delete it from the persistant store immediately) the next time an Atom window gets opened.

The are two potential solutions:

Use dexie as an IndexedDB adapter to store the events between sessions.
Use the IndexedDB APIs directly (this would be more performant but slightly harder to implement).

rafeca commented 5 years ago

I've thought a bit about this and since we don't need any complex querying mechanism (we mostly need to append events when logging stuff and retrieve all the events once a day when sending them to the backend) maybe we can use a different approach for storing the metric events: the raw filesystem.

Basically, we would have different separate files: timing_events.txt, custom_events.txt and counters/<counterName>.txt (fictitious names).

The timing_events.txt/custom_events.txt files would have a separate line for each event with all its metadata, which would get serialized either using JSON or v8.serialize().
For the counters, we would store each counter in a separate file, which would only contain the value of that counter (number of times it has been increased). We could alternatively store all the counters in a single file, but that would force us to have to read this single file on each incrementCounter().

With this solution, this would be the work that the different telemetry methods would have to perform:

addCustomEvent(): Serialize the event and append it to the custom_events.txt file.
addTiming(): Serialize the event and append it to the custom_timings.txt file.
incrementCounter(): Try to read the counters/counterName.txt file. If it exists, increment the correspondant counter, and store the value back to the file. If it doesn't create that file with value 1.
getCustomEvents(): Read the custom_events.txt file, split it by lines and deserialize each line.
getTimings(): Read the timing_events.txt file, split it by lines and deserialize each line.
getCounters(): Find all the counter files (by listing the counters/ folder), read and deserialize each of them.
clearData(): Delete custom_events.txt, timing_events.txt, and the contents of the counters/ folder.

Additionally, since we would access the filesystem asynchronously, we would need some kind of mutex to prevent multiple windows to affect each other. This mutex will be need for the incrementCounter() method and for the transition between the getters and the clearData() method (the mutex is needed in any asynchronous solution that we may want to implement).

Benefits of this approach:

Asynchronous: file system access is async, so we don't block the thread for most of the operations.
The most common methods are really fast: addCustomEvent()/ addTiming() / incrementCounter() do the minimum work possible.
Does not need any external dependencies.

Disadvantages of this approach:

May make the logic a little bit more complex (although it's quite easy to implement).

Thoughts? @jasonrudolph , @nathansobo

jasonrudolph commented 5 years ago

Thoughts? @jasonrudolph , @nathansobo

@rafeca: Thanks for outlining this potential solution. My gut reaction is that this is yet more code for us to maintain, and I'm eager to avoid further growth in the surface area we're maintaining.

Do you have a feel for how much code would be involved in this file-based approach compared to a solution that uses dexie?

nathansobo commented 5 years ago

I also worry about the complexity of interacting directly with the file system. If IndexedDB offers transactions that work across windows, that seems like a more promising path to me.

rafeca commented 5 years ago

Thanks for the feedback! It's fun that both your feedback was posted before my suggestion #TimeZonesAreHard 🤣

Do you have a feel for how much code would be involved in this file-based approach compared to a solution that uses dexie?

I assume that in terms of amount of code it will be quite similar, only slightly less conventional (and probably a bit more convoluted) than just using a DB...

I'll then explore using the IndexedDB APIs directly... if they turn out to be so horrible to use as I remember they were ~7y ago, then I'll explore using lokijs (I'd like to try to avoid adding another dependency that we need to keep updated, etc).

atom / telemetry

Change the storage system to maintain metrics between sessions #26