Open jonas32 opened 3 years ago
In general, the async implementation seems to work. There is still one bug somewhere, that causes the CI to be stuck (see https://github.com/aerospike/aerospike-client-rust/runs/3326575266). Please cancel this CI run since it does not look like it will do by itself and I'm not able to stop it.
I'm not sure yet where this comes from, since the CI does not give me that much info about it and I'm not able to reproduce it locally. Maybe its related to the CI setup. Please pull this version and try to run the async-std tests locally a few times so i get to see if its related to the client or the CI env.
The async-std CI bug is fixed. Looks like a limitation of concurrent connections in the CI env. Since the lazy static solution with the client instance is not working anymore due to async, every test needs to open its own instance of the Client. There was a Problem in the tests for expressions, where a new client was opened for every single expression.
The benchmark tool is still broken. Sadly, the bencher lib does not support the async setup. Testing against the sync API would be useless since the sync setup is just a block_on wrapper to the async one. That just adds a lot of overhead and slows it down.
Please also check the code and tell me if you find anything that i might have forgotten. I think the optimization is good, but surely there is more potential to it. (Ignore batch reads for now. I'm still working on that. The current implementation is still a more or less hacky workaround to get it working).
The client API might still be inconsistent between a few functions. That's mainly related to moving values between threads. It removes some comfort features from the related functions (operate, query and batch). I would like to keep it that way so this comfort can stay for the other functions, but that's up to you.
Right now i hardcoded an older Aerospike version (5.5.0.2) in the CI file to make it run without the required partitioning update. That needs to be removed later.
Ready to review from my side.
I'm a little unsure about the other open PRs. This one moved the whole client to the aerospike-core folder. In theory, git tracked the files as moved and not recreated, so i guess it should also be able to correlate the changes. If not, #100 and #107 need to be changed for this. Please don't merge them before. I think a conflict in this one would be much more work... #105 would be broken because the cargo file was not moved and the updated dependencies were partly removed.
@khaf you said you are working on a rework of the partitions. If you already got anything running there, would you please share it as a draft on a own branch for example so i can check if it fits in without a problem or even already integrate it?
I think unit tests and a new benchmark should be implemented, but i will do that later in another PR. I guess there is already enough to review here with 117 changed files.
attempting to try this stuff out with
aerospike = { git = "https://github.com/asynos/aerospike-client-rust/", branch = "async", features = ["rt-async-std"] }
but am getting
the package `...` depends on `aerospike`, with features: `aerospike-macro` but `aerospike` does not have these features.
Can you check if there is something cached in your system? This looks like it tries to load the old, sync client.
@jonas32 clearing the cargo cache didn't do it
it works when i do
aerospike-core = { git = "https://github.com/asynos/aerospike-client-rust/", branch = "async", features = ["rt-async-std"] }
(adding the -core
) and then in the code i do
use aerospike_core::{self as aerospike};
Fixed. Looks like cargo does not like it when i load member crates as dev dependency. Its now always loading aerospike-macro. For me its compiling correctly now.
@khaf do you have any time plan on when you are going to review this?
I've been mostly sleeping on it, since we are experimenting more with async implementations in our ecosystem and learning new things as we go. The latest case which we have learned is that difference in timeout quantization between different transactions can lead to significant performance penalties. It appears that sharing the same state machine( / event loop) between transactions with wildly different timeouts should be avoided. To be able to do that, the API should be able to let the user to somehow target different event-loops. To my understanding, the current API in this PR seems to be lacking such support and it's not immediately clear to me if the underlying design and libraries include such a feature. Do you have some insights to share in this regard?
Can you give me some more info on the case where you discovered that? I dont really understand whats the point in having multiple runtimes. Normally, that should only be a problem if you start blocking somewhere. You can see in multiple spots in this PR, that i even removed the timeout parameters for operations, since having timeouts is not really needed with an async runtime. Timeouts make sense when you need to cancel something in case it hangs up, so you "free the space" for something new. Something that would be able to block that space would be blocking and therefor not async.
Sorry I got distracted again. The issue is that when you use a single event loop (mapped to an epoll instance in the case of Linux) and mix commands with two significantly different timeout values (e.g. one with t and the other t*5), the event queue fills up with the entries with the longer timeout, and reduces capacity for the transactions for the shorter one, increasing the 99% latency and decreasing throughput. From our limited benchmarks on EC2, the penalty can be up to 2x. In the case of removing timeouts, of course you'll need them to remove entries from the event queue, and precisely control when and how the flow goes back to the user, otherwise you'll leak entries, including file descriptors for connections and whatnot. Every entry (async request) takes memory and is looped over until it is resolved somehow behind the scene.
First to go back to your original question about the loops: In general this is not a problem of the client. The client itself never starts any runtime. It just creates futures. Where and how they are executed is handled by the code that calls the client lib. This is something the user has to take care of.
I started to investigate if the client has similar problems. And it actually has. (only under high load) ~~To be exact, strace says that a Mutex gets locked and never unlocked. I'm not 100% sure where and why, but its probably the connection pools SharedQueue internals Mutex. I searched the whole codebase multiple times now to see if anything blocks (see the last commit). This is extremely strange behavior. Its not like the queue just slows down, its a full deadlock on all threads, even freezing the tend thread. There should really not be any reason why this is permanently locked. Have you somewhere seen such a behavior or any idea why this might happen?~~
The problem is exactly what you just described. Filling up all the available threads of the runtime with long running ops (TCP connection startup is blocking), so no futures will be executed anymore. Ill have a look at how to solve that.
The obvious workaround is to give the user the ability to fine tune the client by being able to assign different commands to different event loops. The first step would be to find out if the underlying libraries allow such tweaking. If they do, we can then discuss different mechanisms for the API for convenience and flexibility. Otherwise, we will have to take our case to the library developers.
In general you are right. But i would like to prevent doing that, since it would not really make sense in the rust ecosystem.
More exact description of the current problem: The TcpStream connection process looks like its blocking. After connecting its async unblocking and fine. But while connecting it blocks the whole thread. Tokio starts with 1 event loop thread per core of the host system. In my case that's 6. As soon as the client opens the 6th connection, all threads have a TCP stream running and are fully blocked. Starting more runtimes would not really solve that, since i would have to start 256 runtimes per node. This overhead is not acceptable in my opinion. On the other hand, this problem only exists if you run a single simple runtime on user side. If you run this for example on top of actix, you will probably never hit this problem. That happens since the client library does not at all decide on what runtime features are scheduled. This is the decision of the runtime that the user starts. You can manipulate where and how it will be scheduled from user side, depending on the context where the operation is executed. For example actix has worker threads with independent runtimes. Even if the Client initially started in the main runtime, the client interaction is done on the runtime of each the worker. Since our problem here is the start of new connections under load, this would also always happen under the worker runtime.
Starting any runtimes in the client itself would break a lot of compatibility (for example with actix) and add a huge complexity into the client. I would prefer finding a way to start the threads without blocking. As a workaround you can set the client policy to max_conns = cpu_num-1. And even with that "downgrade" the client is still extremely fast. I wrote more than 10k records in under 1s.
Found the problem. Now the problem is that the clients retries are too fast. It tries to get a connection from the pool 4 times. After that, the operation fails. Since the pool is non-locking now, the 256 conns are nearly instantly started and a few operations will fail with their 4 retries. I could only reproduce it with more than 8k concurrent writes and it depends on how fast the server is. The simplest solution would be to stop retrying on an iterations base and just using timeouts. What do you think about that?
any update on this?
From my side, it depends on if the client should be handling runtimes itself or not. In many cases, there is probably a runtime handler in place. That means that it would be unnecessary overhead. I have a modified version, that is more optimized for single runtime and external runtime handler setups. I just don't want to push it yet, in case the client should handle that itself. This version here would probably fail often under load since it does not have the modifications from my last message yet.
Generally, everything seems to work fine from a performance and stability perspective. I use this modified version in staging already and had one test service deployed to production with that client version. It just probably needs a lot more testing and review time since this version here changed minor details on most client internals.
@jonas32 all sounds good to me, i've been using it for a few months now too, all my unit tests pass and everything works fine but my deployment isn't under any load or anything
what sort of tests still need to be written? i can help out with that
Im not talking about written tests... That part should be ok. More like testing it in different situations, under load etc. Something that is not total scripted lab env. Problems like a runtime deadlock under load will not show up in scripted tests. That for example only showed up when i first tried to benchmark it.
From my part, I think I have exhausted all the feedback I could give. I don't want to let the perfect be the enemy of the good. If we have a better design that addresses those issues, we can begin looking at it (maybe in a different PR), but if everyone thinks that the current concept is good enough, we can begin reviewing it and plan to merge.
cargo check
produces an error
error: Please select a runtime from ['rt-tokio', 'rt-async-std']
--> aerospike-rt/src/lib.rs:2:1
|
2 | compile_error!("Please select a runtime from ['rt-tokio', 'rt-async-std']");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Thanks for the review @petar-dambovaliev. I will take a look at your suggestions.
cargo check produces an error
I know about that and this is more or less wanted behavior. The imports change based on the selected runtime and there is no default selection. If you have any idea on how to make this better, I'm happy to hear your suggestions.
Thanks for the review @petar-dambovaliev. I will take a look at your suggestions.
cargo check produces an error
I know about that and this is more or less wanted behavior. The imports change based on the selected runtime and there is no default selection. If you have any idea on how to make this better, I'm happy to hear your suggestions.
IMO, i haven't seen anybody use async-std
.
Coupling with tokio
is not an issue, in this sense.
And my opinion is that sync
and async
should be entirely separate projects.
IMO, i haven't seen anybody use async-std. Coupling with tokio is not an issue, in this sense. And my opinion is that sync and async should be entirely separate projects.
I agree. But one of the goals was to have it as modular as possible for the runtime (see #102). I also only see tokio or actix-rt (basically tokio) as relevant, but supporting async-std cant be bad, since its just a swap of imports (see aerospike-rt). And since this 3 runtimes are the major players, it makes sense to be compatible. Most other database libraries also do that. The reference model for the async implementation was the way how sqlx did it. I would be fine with adding rt-tokio to the default features, but not with removing async-std entirely.
The sync version is more or less meant to be deprecated. Its only a compatibility wrapper on top of the async implementation so it wont break the whole API. There still are use-cases where that might be relevant. Thats also the reason why there are no tests for the sync client. Its just the async one with a built-in await.
@petar-dambovaliev i use async-std
:)
Nice work @jonas32, any chance it will be released?
@jonas32 very excited by this patch, it seems tokio is like an even better Rust version of libev, which is where the C client has gone.
It seems to have stalled a little bit, wondering what the future holds.
OK I believe I have neglected this PR long enough. Time to begin merging it, I'll apply my own changes on top. I think it will still take a while to iron everything out, but it's time to get it moving. The following are the necessary steps:
error_chain
needs to go, and preferably replaced with thiserror
. This does not mean we don't need a kind of chained error type (any command that goes to multiple nodes will require chained errors) but the create is deprecated and clippy
will forever complain.&str
fields and arguments with String
. (Or maybe we should state this goal as an all encompassing Lifetime simplification)BatchRead
itself is updated and will be a breaking change if released without the update. Good news: 90% done. Not so good news: 90% remaining.I'll document items as I go through the issues. There is another associated PR that implements value serialization/deserialization, that I think should also go with this one. We need some discussion regarding implicit type conversion during deserialization, in case a user wants an integer, and the field type is string (maybe with a feature flag). Thank you everyone for your patience. Rust client is one of our most important clients and as a result has been subjected to unusual scrutiny from my side. That has in turn stopped the code base moving forward, stalling the incremental updates. The async feature will open a new era for high performance data access to Aerospike, so we want to get it mostly right on the first try. Please keep your ideas coming, we appreciate your feedback.
Sounds great. I like the Idea to remove error_chain. This one is a pain... Let me know if i can support somewhere.
I'm also very happy to do some work to help this happen since we definitely need this patch. @khaf if you were to push some of your in-progress work to a branch, it might be easier for others to jump in.
Regarding (de)serialization. It would be nice to see something capable of directly going from the on-wire msgpack representation to the desired structure and back as the current system based around the Value enum spends a lot of time building these structures that will be read once and thrown away.
For instance, if the Value was a struct containing the on-wire bytes (or even better, a Cow<[u8]>), you could use various from() methods to build it from other types and try_from to deserialize it into the type you want if what's inside can fit into the target.
You could also add failable iterators to iterate the values or key-value pairs of what's inside in place.
It would be a breaking change, but if there is interest, I could give it a shot?
I think thats a great idea. That whole parsing to Values is a lot of time waste when you probably want a Struct in the end anyways. That probably does not even have to be a breaking change, could be like an alternative API. On the other hand, that whole PR is a giant breaking change. The user facing API probably needs breaking changes on more than one point (looking at point 2 of the list above).
Regarding (de)serialization. It would be nice to see something capable of directly going from the on-wire msgpack representation to the desired structure and back as the current system based around the Value enum spends a lot of time building these structures that will be read once and thrown away.
This is an idea that I explored with the Go client (using code generation, unreleased) a few years ago. It is great for simple CRUD, but a lot of our API does partial commands (delete a key in a map, upsert a single bin, etc) and those do not benefit from this optimization. Considering we'll need new API around this feature, I decided that it wasn't worth the complexity. Furthermore, considering that considerable number of allocations in the Rust client are on the stack, this should not make much difference in performance (Obviously I haven't done any benchmarking, just a hunch). Any thoughts?
Hi @khaf , the following variants are all heap based:
String(String), Blob(Vec
We could potentially make list into an opaque enum like:
enum Value<'l> {
Owned(Vec<u8>),
Borrowed(&'l[u8]),
List(Vec<Value<'static>>),
OrderedMap(Vec<(Value<'static>, Value<'static>)>),
HashMap(HashMap<Value<'static>, Value<'static>>),
}
This way it would be easy to get a Option<&mut Vec<Value<'static>>>
, Option<&mut Vec<(Value<'static>, Value<'static>)>
or Option<&mutHashMap<Value<'static>, Value<'static>>>
to add and remove elements inside the structure and edit the Value, while generally leaving most of the stuff inside verbatim.
What do you think? Worth prototyping?
Also, I think I'm going to start testing this branch at Darwinium, will let everyone know how it goes.
Hi guys, is this going to be merged any time soon? Having an async client would be super useful.
Hi guys, is this going to be merged any time soon? Having an async client would be super useful.
@vamshiaruru-virgodesigns Not soon, no. I'm still reviewing it, have a lot of design decisions to make, Rust has made strides in making some of the work here easier and more ergonomic and supporting that requires a lot of study and experimentation. I'm very excited and would love to merge it, but it's a huge commitment to the API that I'm not sure we should make so easily. I promise that it will be merged (or moved to a dedicated async repo) after we get it sorted out in 2023 (hopefully in the first half) but I don't think that'd fit any definition of 'soon'. The good news is that we have new help coming that would relieve me from other commitments and projects and will let me work on this project significantly more.
Thank you all for your help, feedback and patience. I promise the end result will be great, so please look forward to it.
Hey @khaf thanks for clarifying that. I understand why you don't want to go for that commitment with the API included that quick. It still is an API that was designed to be sync from its base, just ported to async with breaking as few things as possible and users just having to add an .await to the function out of the need for an async version quickly internally. There is surely a lot that could be cleaner due to this.
Also a lot has changed in the ecosystem and the language itself since the clients initial design. Moving the PR to a separate repository as something like a "legacy version" sounds like a good plan to me. That would give more room to improve the quirks that the client API has sometimes. Also, this huge PR is probably a nightmare to review and test...
I would also be up to help you in the process of researching/designing/developing a more fitting and cleaner solution than this PR/API is for the addressed problems if that helps you to get this client more "running again" next to your other commitments.
@jonas32 Thanks Jonas, both for your amazing contributions and understanding. I didn't mean that we'd want to move this PR to another repo to put it to pasture, since it is a great piece of code and unless there is reason to the contrary we'll get it in the repo. No reason to let good work go to waste.
What I meant was that as we learn more about how systems work both locally and at scale, new perspectives emerge regarding the use cases different approaches serve. In this specific case, I meant we may want to preserve the threaded version and move the async version to another repo. It is becoming more clear to me that optimizing for latency and throughput are two different goals and async is geared towards the latter. Some of our customers are all about latency and would gladly sacrifice a bit of throughput for very low 99 percentile tail latencies.
There are many more things that I'd like to revisit, including testing, life cycles, serialization, instrumentation, monitoring, self-balancing and healing among others. As I begin to think about them, I'll file new discussions on the relevant Github section so that we get the ball rolling. Of course in the meanwhile, I'll have to get this client updated to support the latest features of the server to keep the hype up :)
Thank you for your help in keeping this branch up to date and going. I'll take you up on your offer for help, since I've grown quite rusty away from Rust.
We invite everyone to take part as we continue to think about the most efficient ways to store and move data at scale.
Hi, just letting everyone know that my company has started to use @jonas32 's branch in production.
It introduces a bug where batch reads on clusters with size > 1 will read batch keys out of order (I've got a pull request fixing this bug pointed towards Jonas' branch).
However, it has brought great improvements to both latency and throughput.
The greater improvements to latency are because in real usage, it's typically rare to ever do anything once... we can now do lots of operations from a single thread using join! and its far faster and simpler than spinning up threads to do it.
It also brings a greater sense of simplicity to the internal design of the crate as it is no longer responsible for managing thread pools or queues in order to do operations in parallel. I trust this could be leveraged to make further improvements in future.
Regarding 99th latency. My previous company (one of Aerospike's earliest customers) found that the best way to solve this was using rapid retries using READ_SEQUENCE. I'm hoping to write a patch based on Jonas' branch to all this also.
@CMoore-Darwinium thanks for the info. I am also planning on using that branch for one of my personal projects. I wonder, have you tried running the sync client with tokio::spawn_blocking
vs just running the async client, to see how much performance difference is there?
@vamshiaruru-virgodesigns we were using spawn::blocking on the sync client until we switched.
I don't really think we have an apples-to-apples comparison because the async version allows us to do some stuff that we weren't before. With spawn_blocking we needed to be very disciplined about how many tasks we created, since had to contend with blocking pool exhaustion with high traffic.
For the "finalization" part of our transactions it was taking ~30ms to perform one write and maybe 8-9 operations in series. This was also highly variable based on load.
Now we use join_all to perform all the async writes on the same thread. This part is consistently over and done with in 1.2ms-1.5ms.
Granted, this is all very unscientific because obviously there were some major structural code changes to go from sync to async. However, I'd credit the async client for allowing us to make those code changes.
@khaf don't get me wrong there. I didn't read your message as if you were saying that the PR should be moved out of sight or should be skipped. I just understand why you are not comfortable with committing to the API that this PR introduces on the longer run. Not focusing on the async client exactly as this PR proposes is totally fine for me, since i agree to the problems that you pointed out before.
@CMoore-Darwinium
It introduces a bug where batch reads on clusters with size > 1 will read batch keys out of order (I've got a pull request fixing this bug pointed towards Jonas' branch).
Oh, i noticed this too while using the branch, but i did not expect it to be ordered at all in the first place. But good to know. I guess i can skip fixing it in here since it looks like you did that on your PR already.
@vamshiaruru-virgodesigns I'm not sure if that would be a fair benchmark between sync and async. The sync client in this branch has no own implementation at all, but rather wraps the async codebase and just block on the futures. https://github.com/aerospike/aerospike-client-rust/blob/cdff64b731cf071b87a439e369c3e0b2b4e5abf6/aerospike-sync/src/client.rs#L84 The aerospike_core crate is simply the async client. I didn't check, but i guess it can not win that test. Its just adding extra overhead by blocking futures so the user does not have to as a compatibility layer to the current release. Testing the current master version against the async one would be more accurate. That way, its still great if you want to go for throughput, but not ideal for the scenario that @khaf is talking about where customers want to go for latency instead.
When i built this, i tested the performance under a few different scenarios. It turned out, that the way the thread pool is managed is a big factor. Just opening a tokio runtime and spamming the async functions will not be very efficient and might even decrease the performance, at least for short running operations. Long running ones would profit again in that case. Same probably counts for the new sync wrapper here. The real strength is in a scenario like actix web, where the webserver handles the runtime and its threads more efficiently. I saw a great performance increase in that case. Really benchmarking the async client and getting solid numbers is not easy, since a lot of factors can count into that. That's also the reason why this version does not ship async benchmarks at all currently.
@khaf just letting you know, I have a patch in https://github.com/darwinium-com/aerospike-client-rust to make the client rack aware, as well as introducing replica retry policies. As far as I know, this is the main missing feature in the Rust client and it works now.
We're currently using this async patch, so this patch is a precondition.
Hello @CMoore-Darwinium @jonas32 I created patch with TLS support for client from this PR. If anyone can test it with EE version of aerospike that would be great. Also, I want create PR with trust-dns-resolver
for asynchronous resolving aerospike hosts.
Alright, I think the best way to move forward would be to create a branch named async
, and merge this PR and all of the work based on it into that branch as I review them. This way we can take some concrete steps towards merging the branch and all other related work into the master, while having a central branch for future PRs. Looking forward to feedback.
I'll create that branch latest Sunday evening CET if I don't hear back from anyone.
The async
branch is created and can be used for future development. I'm looking into merging @CMoore-Darwinium 's and @Hartigan 's branches next to merge next.
Hey @khaf sounds like a good plan. Should I fix the stuff you already commented on myself on this or should I wait for you to finish?
@jonas32 Don't worry about it, I'll take care of them. If you have some free time and looking for an interesting problem, I believe the lazy iterator for the serializers is the one.
the lazy iterator for the serializers
Just for clarification, you are talking about something in the direction of impl From<Iterator> for Bins
right?
It's about PR #126 . At the moment those derive macros implement:
pub trait ToBins {
/// Convert the struct to a `Bin` Array
fn to_bins(&self) -> Vec<Bin>;
}
That Vec is expensive and unnecessary since you don't need the bins all at the same time. It would be better if we could get this:
fn to_bin_iter(&self, bins: Option<[String]>) -> iter::Iterator<Bin>;
This way the bin is allocated on the stack and it does not matter how big the struct is.
@jonas32 I'm trying to use the aerospike-sync
crate for a library, but I encounter this error for rt-tokio
. Are you familiar with the issue?
thread '<unnamed>' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime'
I see a lot of mentions on the internet, but scarcely anyone provides a workaround that works. How did you test it?
This first commit includes most of the basic changes to the client to get it running on async-std and tokio. There are surely a few more spots that need work or block threads. Feel free to have a look. The client is not really working yet. I did never test it at least since unit tests are still broken.
Tests are not working yet. There are still a few logical things that need to be changed for async, but that would again be a bigger change itself (thread pools etc.).