Replace Erlang ODBC - Githubissues

sebastian commented 6 years ago

The Erlang ODBC implementation is old, and doesn't seem to have received meaningful updates for years. This wouldn't have been a problem if it worked well. However it:

lacks support for common types
produces random errors (with SQL Server we have seen it issue broken queries when attempting to stream data)
is often slow when streaming data

Using ODBC drivers as a means of integrating new backends is beneficial as database vendors usually have official ODBC drivers we can make use of rather than having to rely on third party Elixir libraries (if they exist at all).

Alternative approaches:

We could integrate a C version of ODBC directly as a NIF (like unixODBC, iODBC?) writing the corresponding wrappers around it
We could make use of one of the many Rust ODBC libraries (odbc-sys (low FFI), odbc-safe (wrapper around odbc-sys), odbc (wrapper around odbc-sys)) in combination with something like Rustler to make it available as a NIF.
We could use any sane ODBC implementation in any language, be it Rust or otherwise (.NET Core does not seem to have a native one but there seem to be implementations in Mono and third party implementations that could be used).

In general I am in favor of moving towards a solution that gives us type safety and makes it hard to produce runtime crashes. Especially if we integrate the solution as a NIF.

A solution based on Rust has the benefit that it could allow for moving further parts of the data processing path into rust later, and we could then keep the data in rust for longer before moving it into Erlang, avoiding unnecessary copying.

This change might make it into 18.2, or 18.3. It's an area where we should experiment and play around before settling on any final solution.

cristianberneanu commented 6 years ago

@sebastian This can be implemented either as a port or as a port driver. Port drivers are unsafe, but they should be faster. I think the current odbc library is implemented as a port and that is one of the main reasons for the slower performance. What is your preferred method? Also, what language should I use for the implementation?

sebastian commented 6 years ago

And why not a dirty scheduled NIF? As far as I understand they allow for async operations with streaming data too.

I would really like for us to use a language with a strong type system that prevents crashes and where I would trust myself to make changes too.

My current favourite candidate for the job would be rust. It doesn't impose runtime overhead and has a great type system. Could you give it a look? There is a library for making rust NIFs that deals with all the encoding and interop out of the box: https://github.com/hansihe/rustler, and here is an ODBC crate too: https://crates.io/crates/odbc

Do you have other suggestions when it comes to choice of language?

sebastian commented 6 years ago

Hm, a port driver might in fact be a better option that a NIF. And it seems certainly a better option than a pure port. However, then we'll have to find an efficient way to deal with a lot of data encoding and decoding ourselves. Maybe Cap'n Proto would be a candidate for that.

cristianberneanu commented 6 years ago

I don't think NIFs are the proper choice for this case. NIFs should be used for CPU-intensive tasks, this will be a IO-intensive task (and maybe secondary, a memory-intensive one). That is why the runtime overhead of the language doesn't matter as much.

sebastian commented 6 years ago

If we get good at rust then we could potentially move other bottlenecks over to rust over time too. I.e. have a port driver for the ODBC layer, and then use rust-based NIFs for parts of the anonymizer that currently slow us down (with the normal caveats of profiling and finding hotspots, and the algo being stable enough etc etc)

(i.e. the point I was trying to make was that it might be good to find a single new language we can choose to integrate, rather than fragmenting our system further. I would personally love for that language to be Haskell, but I don't think it's the most sensible choice, so I'll refrain from suggesting it.)

cristianberneanu commented 6 years ago

and then use rust-based NIFs for parts of the anonymizer

Well, you know how I feel about that, no reason to open that subject again :)

sebastian commented 6 years ago

Haha :) Let's skip that part of the conversation and bring it back to the ODBC driver. Do you have other preferences for language choice besides C/C++ that we should take into consideration?

cristianberneanu commented 6 years ago

Well, I feel most comfortable in C/C++ (and then maybe in C#, but I don't think it would be a great fit here), but I also don't mind at all giving Rust a try here.

obrok commented 6 years ago

If I may - please give rust a try. I think it would be much nicer to branch out into than C/C++

cristianberneanu commented 6 years ago

OK, Rust it is for now :)

sebastian commented 6 years ago

And let's please use the rust code formatter from day 1. Then we avoid later having the discussion we are currently having for elixir.

sebastian commented 6 years ago

And please add rust with asdf too.

cristianberneanu commented 6 years ago

I have a proof-of-concept version of the driver running, but I ran into some problems and decided I should write a progress report before going forward:

The short version is that I am not convinced Rust is the right tool for this job.

The long version:

I found the tooling in the Rust ecosystem good and the language has an interesting set of features. That being said, I also found plenty of shortcomings:

Rust is a complex language and the learning curve is high.
The ecosystem seems immature: there are many samples and libraries written in 2014, 2015 or 2016 that are obsolete now.
The language has high friction: convincing the compiler the code is correct is a continuous struggle, at least for novice programmers. One has to add a lot of annotations to the code which is annoying. This is similar to static verifies for C code, which already existed for some time and have similar problems (although the language being designed with that in mind attenuates some of the problems).

Regarding our specific problem:

In order to interact with the C code, one has to write a lot of unsafe code. This negates many of the safety features, leads to the same kind of bugs that plague classical native code and also requires pretty deep C knowledge. This problem is present on both sides of the component, on the interaction with the Erlang VM (for which we need to generate wrapper libraries from C headers) and on the interaction with the odbc libraries (where I already encountered a crash that took me a while to debug: https://users.rust-lang.org/t/undefined-behaviour-after-unsafe-enum-usage/15572). The remaining code is not that big for the safety benefits to kick in.
The odbc library for Rust has some big shortcomings: besides some basic primitive types, it retrieves everything else as strings, it doesn't handle some edge cases, sometimes returning null for large strings, and it doesn't allow non-utf8 strings (making retrieving arbitrary binary data impossible).

For me personally, this was a good experience as I got to try a new ecosystem. But looking back, I think plain C would have been a better choice for the component, as one has more control and the codebase wouldn't be large enough to make the code review very hard.

Looking forward: I can either try to fix the issues in the odbc crate or rewrite the port driver in C. Right now, I suspect both have the same implementation complexity. Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.

sebastian commented 6 years ago

Thanks for the sobering report.

Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.

Feel free to dig further and see. If it doesn't improve then we'll have to consider alternatives.

cristianberneanu commented 6 years ago

I implemented a first draft of the port driver, added another data source driver for SQL Server that uses it, and these are the compliance tests values:

	#	min	5th	50th	95th	99th	max
PostgreSQL (postgresql9.4)	1640	26	61	165	1012	3133	10064
MySQL (mysql5.7)	1624	25	61	190	1276	3134	20511
SQLServerRODBC (sqlserver2017_rodbc)	1632	31	80	201	1217	3065	13120
SQLServerTds (sqlserver2017_tds)	1600	28	77	219	5966	18859	49090
SQLServer (sqlserver2017_odbc)	1600	88	156	365	2087	4165	14382

The data transfer can probably be optimized further, not sure if we need to invest more time into it at the moment. My next step is to also implement a data source driver over it for SAP Hana.

Do you want to keep the drivers that use the standard odbc library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?

sebastian commented 6 years ago

Great work! It's nice to see the SQL Server now being more or less on par with the Postgres service in terms of performance!

Do you want to keep the drivers that use the standard odbc library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?

Let's keep them for the time being, but schedule to remove them with the subsequent release. It's good to have a set of variants to use when things get hairy in real deployments.

cristianberneanu commented 6 years ago

With the addition of #2453 I consider this complete, at least for now.

Aircloak / aircloak

Replace Erlang ODBC #2106