Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Replace Erlang ODBC #2106

Closed sebastian closed 6 years ago

sebastian commented 6 years ago

The Erlang ODBC implementation is old, and doesn't seem to have received meaningful updates for years. This wouldn't have been a problem if it worked well. However it:

Using ODBC drivers as a means of integrating new backends is beneficial as database vendors usually have official ODBC drivers we can make use of rather than having to rely on third party Elixir libraries (if they exist at all).


Alternative approaches:

In general I am in favor of moving towards a solution that gives us type safety and makes it hard to produce runtime crashes. Especially if we integrate the solution as a NIF.

A solution based on Rust has the benefit that it could allow for moving further parts of the data processing path into rust later, and we could then keep the data in rust for longer before moving it into Erlang, avoiding unnecessary copying.

This change might make it into 18.2, or 18.3. It's an area where we should experiment and play around before settling on any final solution.

cristianberneanu commented 6 years ago

@sebastian This can be implemented either as a port or as a port driver. Port drivers are unsafe, but they should be faster. I think the current odbc library is implemented as a port and that is one of the main reasons for the slower performance. What is your preferred method? Also, what language should I use for the implementation?

sebastian commented 6 years ago

And why not a dirty scheduled NIF? As far as I understand they allow for async operations with streaming data too.

I would really like for us to use a language with a strong type system that prevents crashes and where I would trust myself to make changes too.

My current favourite candidate for the job would be rust. It doesn't impose runtime overhead and has a great type system. Could you give it a look? There is a library for making rust NIFs that deals with all the encoding and interop out of the box: https://github.com/hansihe/rustler, and here is an ODBC crate too: https://crates.io/crates/odbc

Do you have other suggestions when it comes to choice of language?

sebastian commented 6 years ago

Hm, a port driver might in fact be a better option that a NIF. And it seems certainly a better option than a pure port. However, then we'll have to find an efficient way to deal with a lot of data encoding and decoding ourselves. Maybe Cap'n Proto would be a candidate for that.

cristianberneanu commented 6 years ago

I don't think NIFs are the proper choice for this case. NIFs should be used for CPU-intensive tasks, this will be a IO-intensive task (and maybe secondary, a memory-intensive one). That is why the runtime overhead of the language doesn't matter as much.

sebastian commented 6 years ago

If we get good at rust then we could potentially move other bottlenecks over to rust over time too. I.e. have a port driver for the ODBC layer, and then use rust-based NIFs for parts of the anonymizer that currently slow us down (with the normal caveats of profiling and finding hotspots, and the algo being stable enough etc etc)

(i.e. the point I was trying to make was that it might be good to find a single new language we can choose to integrate, rather than fragmenting our system further. I would personally love for that language to be Haskell, but I don't think it's the most sensible choice, so I'll refrain from suggesting it.)

cristianberneanu commented 6 years ago

and then use rust-based NIFs for parts of the anonymizer

Well, you know how I feel about that, no reason to open that subject again :)

sebastian commented 6 years ago

Haha :) Let's skip that part of the conversation and bring it back to the ODBC driver. Do you have other preferences for language choice besides C/C++ that we should take into consideration?

cristianberneanu commented 6 years ago

Well, I feel most comfortable in C/C++ (and then maybe in C#, but I don't think it would be a great fit here), but I also don't mind at all giving Rust a try here.

obrok commented 6 years ago

If I may - please give rust a try. I think it would be much nicer to branch out into than C/C++

cristianberneanu commented 6 years ago

OK, Rust it is for now :)

sebastian commented 6 years ago

And let's please use the rust code formatter from day 1. Then we avoid later having the discussion we are currently having for elixir.

sebastian commented 6 years ago

And please add rust with asdf too.

cristianberneanu commented 6 years ago

I have a proof-of-concept version of the driver running, but I ran into some problems and decided I should write a progress report before going forward:

The short version is that I am not convinced Rust is the right tool for this job.

The long version:

I found the tooling in the Rust ecosystem good and the language has an interesting set of features. That being said, I also found plenty of shortcomings:

Regarding our specific problem:

For me personally, this was a good experience as I got to try a new ecosystem. But looking back, I think plain C would have been a better choice for the component, as one has more control and the codebase wouldn't be large enough to make the code review very hard.

Looking forward: I can either try to fix the issues in the odbc crate or rewrite the port driver in C. Right now, I suspect both have the same implementation complexity. Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.

sebastian commented 6 years ago

Thanks for the sobering report.

Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.

Feel free to dig further and see. If it doesn't improve then we'll have to consider alternatives.

cristianberneanu commented 6 years ago

I implemented a first draft of the port driver, added another data source driver for SQL Server that uses it, and these are the compliance tests values:

# min 5th 50th 95th 99th max
PostgreSQL (postgresql9.4) 1640 26 61 165 1012 3133 10064
MySQL (mysql5.7) 1624 25 61 190 1276 3134 20511
SQLServerRODBC (sqlserver2017_rodbc) 1632 31 80 201 1217 3065 13120
SQLServerTds (sqlserver2017_tds) 1600 28 77 219 5966 18859 49090
SQLServer (sqlserver2017_odbc) 1600 88 156 365 2087 4165 14382

The data transfer can probably be optimized further, not sure if we need to invest more time into it at the moment. My next step is to also implement a data source driver over it for SAP Hana.

Do you want to keep the drivers that use the standard odbc library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?

sebastian commented 6 years ago

Great work! It's nice to see the SQL Server now being more or less on par with the Postgres service in terms of performance!

Do you want to keep the drivers that use the standard odbc library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?

Let's keep them for the time being, but schedule to remove them with the subsequent release. It's good to have a set of variants to use when things get hairy in real deployments.

cristianberneanu commented 6 years ago

With the addition of #2453 I consider this complete, at least for now.