MegaAntiCheat / masterbase

API/Data Platform for Ingesting, Storing, and Serving Data through Postgres, and Litestar
9 stars 1 forks source link

On-demand database exports #73

Closed kavorite closed 1 week ago

kavorite commented 2 weeks ago

The intent is that this endpoint's eligibility as upstream functionality is mutually exclusive with #71. One or the other should be decided upon for implementation. I'm in favor of this model personally ( it is totally not because I got to use previously unexplored coroutine hacks ).

kavorite commented 2 weeks ago

This branch lacks compression, but it probably wouldn't be that difficult to implement

kavorite commented 2 weeks ago

update: the hacks were too hacky

kavorite commented 1 week ago

Implementing the integration test left only one more housekeeping item: finding a way to prevent threads reading and writing cached results from racing. That includes unnecessarily overwriting the cache file, reading to the end of an incomplete cache file as it is still being written, and a panoply of other potential race conditions.

There are ways of mitigating this. For instance, moving a file on POSIX systems is atomic, which can be leveraged to prevent contention. Alternatively, rendezvous with the database could be synchronized to ensure no requester can export while another is running.

Still, concurrent programming is complicated. Even if all of those synchronization issues were easy to resolve (they aren't), those solutions involve either SPMC queues (complexity), file system locks (more complexity), or multiple threads doing random I/O, which is not only complex but accounts for the lion's share of the overhead that would prompt us to avoid performing concurrent COPY queries in the first instance. Every solution I could propose requires correctness, throughput, simplicity, or ease of use compromises.

I'm declaring this a case of premature optimization, and all those potential avenues for enhancement out of scope for this PR. The most straightforward approach, by far, is not to hand out analyst authorization to anyone found to abuse this functionality. If we encounter any issues, we can implement an endpoint to allow requesters to check how many exports are in progress, guard them with a semaphore, or accept a list of columns to export.

Unless and until such measures are necessary, this branch begins a new export job every time a snapshot is requested. It delegates synchronization to Postgres, simplifying its implementation and preserving its correctness by privatizing results. It should still be reasonably performant, provided that people don't start too many simultaneous COPY routines. We'll correct any problems that present themselves moving forward.