Closed sebastian closed 6 years ago
@sebastian This can be implemented either as a port or as a port driver. Port drivers are unsafe, but they should be faster. I think the current odbc
library is implemented as a port and that is one of the main reasons for the slower performance. What is your preferred method?
Also, what language should I use for the implementation?
And why not a dirty scheduled NIF? As far as I understand they allow for async operations with streaming data too.
I would really like for us to use a language with a strong type system that prevents crashes and where I would trust myself to make changes too.
My current favourite candidate for the job would be rust. It doesn't impose runtime overhead and has a great type system. Could you give it a look? There is a library for making rust NIFs that deals with all the encoding and interop out of the box: https://github.com/hansihe/rustler, and here is an ODBC crate too: https://crates.io/crates/odbc
Do you have other suggestions when it comes to choice of language?
Hm, a port driver might in fact be a better option that a NIF. And it seems certainly a better option than a pure port. However, then we'll have to find an efficient way to deal with a lot of data encoding and decoding ourselves. Maybe Cap'n Proto would be a candidate for that.
I don't think NIFs are the proper choice for this case. NIFs should be used for CPU-intensive tasks, this will be a IO-intensive task (and maybe secondary, a memory-intensive one). That is why the runtime overhead of the language doesn't matter as much.
If we get good at rust
then we could potentially move other bottlenecks over to rust over time too. I.e. have a port driver for the ODBC layer, and then use rust-based NIFs for parts of the anonymizer that currently slow us down (with the normal caveats of profiling and finding hotspots, and the algo being stable enough etc etc)
(i.e. the point I was trying to make was that it might be good to find a single new language we can choose to integrate, rather than fragmenting our system further. I would personally love for that language to be Haskell, but I don't think it's the most sensible choice, so I'll refrain from suggesting it.)
and then use rust-based NIFs for parts of the anonymizer
Well, you know how I feel about that, no reason to open that subject again :)
Haha :) Let's skip that part of the conversation and bring it back to the ODBC driver. Do you have other preferences for language choice besides C/C++ that we should take into consideration?
Well, I feel most comfortable in C/C++ (and then maybe in C#, but I don't think it would be a great fit here), but I also don't mind at all giving Rust a try here.
If I may - please give rust a try. I think it would be much nicer to branch out into than C/C++
OK, Rust it is for now :)
And let's please use the rust code formatter from day 1. Then we avoid later having the discussion we are currently having for elixir.
And please add rust with asdf too.
I have a proof-of-concept version of the driver running, but I ran into some problems and decided I should write a progress report before going forward:
The short version is that I am not convinced Rust is the right tool for this job.
The long version:
I found the tooling in the Rust ecosystem good and the language has an interesting set of features. That being said, I also found plenty of shortcomings:
Rust is a complex language and the learning curve is high.
The ecosystem seems immature: there are many samples and libraries written in 2014, 2015 or 2016 that are obsolete now.
The language has high friction: convincing the compiler the code is correct is a continuous struggle, at least for novice programmers. One has to add a lot of annotations to the code which is annoying. This is similar to static verifies for C code, which already existed for some time and have similar problems (although the language being designed with that in mind attenuates some of the problems).
Regarding our specific problem:
In order to interact with the C code, one has to write a lot of unsafe
code. This negates many of the safety features, leads to the same kind of bugs that plague classical native code and also requires pretty deep C knowledge.
This problem is present on both sides of the component, on the interaction with the Erlang VM (for which we need to generate wrapper libraries from C headers) and on the interaction with the odbc
libraries (where I already encountered a crash that took me a while to debug: https://users.rust-lang.org/t/undefined-behaviour-after-unsafe-enum-usage/15572). The remaining code is not that big for the safety benefits to kick in.
The odbc
library for Rust has some big shortcomings: besides some basic primitive types, it retrieves everything else as strings, it doesn't handle some edge cases, sometimes returning null
for large strings, and it doesn't allow non-utf8 strings (making retrieving arbitrary binary data impossible).
For me personally, this was a good experience as I got to try a new ecosystem. But looking back, I think plain C would have been a better choice for the component, as one has more control and the codebase wouldn't be large enough to make the code review very hard.
Looking forward: I can either try to fix the issues in the odbc
crate or rewrite the port driver in C. Right now, I suspect both have the same implementation complexity. Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.
Thanks for the sobering report.
Personally, even if I am not convinced yet by the attractiveness of Rust, I would prefer to go deeper into it and get a more informed view.
Feel free to dig further and see. If it doesn't improve then we'll have to consider alternatives.
I implemented a first draft of the port driver, added another data source driver for SQL Server that uses it, and these are the compliance tests values:
# | min | 5th | 50th | 95th | 99th | max | |
---|---|---|---|---|---|---|---|
PostgreSQL (postgresql9.4) | 1640 | 26 | 61 | 165 | 1012 | 3133 | 10064 |
MySQL (mysql5.7) | 1624 | 25 | 61 | 190 | 1276 | 3134 | 20511 |
SQLServerRODBC (sqlserver2017_rodbc) | 1632 | 31 | 80 | 201 | 1217 | 3065 | 13120 |
SQLServerTds (sqlserver2017_tds) | 1600 | 28 | 77 | 219 | 5966 | 18859 | 49090 |
SQLServer (sqlserver2017_odbc) | 1600 | 88 | 156 | 365 | 2087 | 4165 | 14382 |
The data transfer can probably be optimized further, not sure if we need to invest more time into it at the moment. My next step is to also implement a data source driver over it for SAP Hana.
Do you want to keep the drivers that use the standard odbc
library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?
Great work! It's nice to see the SQL Server now being more or less on par with the Postgres service in terms of performance!
Do you want to keep the drivers that use the standard odbc library for a while longer in the codebase (in case some problem appears) or do you want them replaced by the new version?
Let's keep them for the time being, but schedule to remove them with the subsequent release. It's good to have a set of variants to use when things get hairy in real deployments.
With the addition of #2453 I consider this complete, at least for now.
The Erlang ODBC implementation is old, and doesn't seem to have received meaningful updates for years. This wouldn't have been a problem if it worked well. However it:
Using ODBC drivers as a means of integrating new backends is beneficial as database vendors usually have official ODBC drivers we can make use of rather than having to rely on third party Elixir libraries (if they exist at all).
Alternative approaches:
odbc-sys
), odbc (wrapper aroundodbc-sys
)) in combination with something like Rustler to make it available as a NIF.In general I am in favor of moving towards a solution that gives us type safety and makes it hard to produce runtime crashes. Especially if we integrate the solution as a NIF.
A solution based on Rust has the benefit that it could allow for moving further parts of the data processing path into rust later, and we could then keep the data in rust for longer before moving it into Erlang, avoiding unnecessary copying.
This change might make it into 18.2, or 18.3. It's an area where we should experiment and play around before settling on any final solution.