Frontend <> Backend communication

cristianberneanu commented 3 years ago

We need to agree on the way the Frontend communicates with the Backend.

Since transpiling the reference code to JS resulted in poor performance, the anonymization code will stay in dotnet. Furthermore, I don't think it is a good idea to manually build the query AST in JS land. It couples the Frontend and Backend internals too much. Sending a SQL statement feels cleaner.

As input we send: filename, query statement, anonymization settings. As output we get: query result or an error.

Option 1: anonymize using the CLI.

We pass the input as command-line arguments , we get back the query result (as either CSV or JSON) in the stdout stream or we get an error in stderr stream.

PROs:

We don't need to have .NET code in the publisher repository;
It keeps the reference code separate from the GUI and free of pollution with Frontend concerns;
Allows for easy automatization, as all the functionality is easily accessible from the CLI;
Makes sure the reference tool works as intended from the CLI (since the Frontend depends on it).

CONs:

We won't have live progress reports (unless we get a bit hacky);
We pay the CLR startup cost for each anonymization call;
Functionality will be limited to what the CLI provides.

Option 2: anonymize using IPC.

We will need an additional .NET project in this repository that loads the core reference library and dispatches anonymization requests to it. We pass the input as a JSON object and we get back a JSON object with the result or error. We need to decide if we use a socket or the process stdio streams for message exchange.

PROs:

We can add functionality not supported by the CLI;
JSON messages are more expressive than invoking a CLI application;
Lower latency, since the CLR is kept loaded.

CONs:

Additional .NET code added to this repository;
CLI might become stale, since it will be rarely used;
Tighter coupling between the publisher and reference repositories;
Reference code will get polluted with Frontend concerns (like progress reports).

I am slightly in favor of Option 1 (I don't consider the drawbacks for it too big).

sebastian commented 3 years ago

I don't think it is a good idea to manually build the query AST in JS land. It couples the Frontend and Backend internals too much. Sending a SQL statement feels cleaner.

Yes, building the AST in JS only made sense as long as the AST could immediately be executed there too.

sebastian commented 3 years ago

I vote for Option 1 too.

I additionally vote for using JSON as the output as it's easier to use in the frontend than parsing some CSV output.

We can live without progress reports, and if we need it later we can get hacky then.

edongashi commented 3 years ago

Do we drop the JS CSV parser? If yes, do we use the backend to figure out the shape when we load a file? If not, we need to use 2 different CSV libraries where each may have their own tiny differences.

sebastian commented 3 years ago

Do we drop the JS CSV parser? If yes, do we use the backend to figure out the shape when we load a file? If not, we need to use 2 different CSV libraries where each may have their own tiny differences.

Good point, @edongashi.

We either need another parser for the GUI or need to extend the Reference with an endpoint that returns a schema... In either case, as long as we want to support CSV, it seems the CLI interface must be extended to support providing a schema as part of the input too!?

cristianberneanu commented 3 years ago

I say we do the CSV parsing only in the backend/reference tool. To load the initial raw data (including the schema) the frontend could issue a standard SELECT * FROM 'file_name' query.

cristianberneanu commented 3 years ago

This seems settled (at least for now).

diffix / desktop

Frontend <> Backend communication #20

Option 1: anonymize using the CLI.

Option 2: anonymize using IPC.