Open janis91 opened 6 years ago
Hey @janis91 - thanks for moving this here. This is something that @southpolesteve and I would really like to do. This likely will be done in stages (non-session writes are really easy, session consistency reads are pretty complex). We also won't work on this until we've gotten the new SDK released to GA since we cannot deprecate the old SDK until this one has GA'd which doubles the cost for us.
Short version: We really want to do this, but it likely won't start until after the new version GA's.
Hi @christopheranderson, you're welcome. Actually I already expected that this won't make it into the initial version, here. I like the new approach with the async/await style / promise-based methods. So I think this will already be a big step forward. And after that I am hoping for further improvement :-)
Absolutely. We're also moving all(*) our development onto GitHub, so you should be able to track the improvements as they are in the works. :)
FYI - we're doing some user studies on the new model. If you're interested, email me at chrande (at) microsoft (dot) com. We have "final" round before our first preview release starting the 23rd.
(*) - There might be "surprise" features that get developed on a private feature branch, but they'll be merged into the main dev/master branches. We'll try to do this as little as possible.
Unfortunately, we are not able to do this at the moment. But we are still looking forward to the release :-)
Ah crap, the only reason I started implementing this SDK was because I thought it supported direct?!
This SDK will support Direct, but it wasn't a requirement for GA. It's the next feature on my queue, though, if that makes you feel better (but it's a heavy feature, so still a bit out).
Any progress?
Also waiting for this. Getting much better performance with MongoDB driver at the moment, but would prefer to use this to get support for multi master. However, current performance makes it faster to run a single master on a region on the other side of the planet and use MongoDB driver than this one in the same region.
@hassellof, can you elaborate on the stack? What node js mongo library do you use?
@christopheranderson ?
@southpolesteve ?
Sorry for the delay in response here.
No news on Direct Mode support for JS. We're in the final phases of Direct Mode for Java (HTTP is out and live in 2.4.0 and TCP is feature complete but not at the quality bar we need). We'll evaluate how we want to approach the next Direct Mode implementation once we've gotten Java to the place we need it to be.
RE: performance - would appreciate any numbers you're willing to share (what do you need/expect vs what you're currently observing). I've worked with a few customers to help make Gateway mode work fine performance wise.
Any update?
This issue isn't on the V3 feature list?
@christopheranderson what kind of information/metrics would be valuable to get more movement behind direct mode?
We've already started to batch requests to the API in lieu of singleton's due to the SNAT port limitations on Azure App Services; we'd really benefit from the direct mode as we move forward due to the throughput needed.
@tony-gutierrez @antempus As Chris mentioned in the v3 issue, implementing Direct Mode is a big undertaking for us. 8 months, 3 devs, when we did it for Java.
When I work with other customers on perf problems, we often find Direct Mode is not the solution. Direct mode is primarily an optimization for latency, not throughput.
In the case of port exhaustion, its likely Direct Mode will make it worse. Consider an API exposing a cross partition query that hits 50 partitions with maxDegreeOfParallism = -1. At 20 concurrent API requests, node will attempt to open 1000 concurrent connections. The node default for an agent with keepAlive: true is 256 so you'll be port constrained by the agent. The queries will get slow (surfaces as latency), but the real problem is throughput.
In gateway mode, all connections go to the same host so they are much easier for the agent to reuse. Direct mode will tax the agent even more since each connection can only be reused for the exact same partition. You'll see a lot more stale connections pruned from the server-side and pay more costs for reconnection/TLS/etc.
Some ideas on how to get more perf out of the current SDK:
I would be happy to discuss your specific workloads, help debug the perf issues, and chat alternative partitioning strategies. Drop me an email stfaul@microsoft.com
Steve, while I appreciate the suggestions, I feel like most people who are clamoring for direct mode have probably already tuned to the point of latency being the limiting factor.
Node is not a limiting factor. The ports on app service are a limiting factor already, so why not have their usage be accomplished faster with one less network hop?
We have pretty constant high db traffic. We have been all over the map with keep alive, custom agents, and port limits. Our current mostly stable configuration is agentkeepalive (because MS products close all connections after 120 seconds, no matter what, and it's the only agent that has a socket TTL setting that can be set to 110 seconds) with the following config:
// attempt to keep global socket usage under 160 for app service.
agentConfig: {
keepAlive: true,
maxSockets: 25, // this is per host
maxFreeSockets: 10, //per host
timeout: 60000,
freeSocketTimeout: 30000, //not used if using normal agent
socketActiveTTL: 110000
}
With this setup we will usually have 3 Cosmos hosts, with the sockets maxed. I would much rather have 10 times the hosts (although Cosmos always seems to fit our data into 5 partitions) with fewer sockets allowed and faster execution of queries by id (95% of our db traffic). But even better would be TCP direct mode, eliminating the overhead of the HTTP agent all together, and the overhead of all the headers and query string parameters that caused the recent overflow issue.
We stay fast by spreading the load over many more servers than we really need, just due to the bottleneck of cosmos connections and latency of the additional network hop.
@southpolesteve
Great info and we will give this a shot to see if the tuning helps; I think it will help in our dev/cert env, but in prod, we're going to see orders of magnitude more spikey requests to the API.
What's interesting is that the majority of the challenges stem not from Cross Partition queries, but rather the sheer number of request coming into the App Service; we're already batching the POST to the API but still see errors on the API.
I'll try these suggestions and bump the App Service up to the now lower priced P1V2 and post the results.
@tony-gutierrez Can you share some of the latency #s you are seeing and at what kind of load? If point reads are slow, I would like to dig into that more. I will try to repro on app service.
It would also be helpful to know if requests are being queued. You can grab it off of agent. Some code I have used before to log this:
const agent = new Agent({
keepAlive: true
})
setInterval(() => {
Object.values(agent.requests).flat().length
}, 1000)
Approaching half a year later... Any updates on direct mode?
Hi Team - any update on the direct mode feature?
@antempus s What were your results?
@jay-most you can mark my comments are no longer valid, I've moved on to different work/company and I cannot recall why we needed, or thought we needed, this feature.
Hi @jay-most can you confirm when direct mode will be available, as with Gateway mode we are seeing 4 to 5 second latency when we have spike in load in Azure metrics, or can someone please suggest what can be the optimal configuration to reduce latency.
Container Configuration:
MAX RU's: 25000 with autoscaling enabled.
Query configuration :
We are querying on the basis of partitionKey and Index filed.
export const MAX_DEGREE_OF_PARALLELISM = -1;
export const MAX_ITEM_COUNT = 1000;
export const BUFFER_ITEMS = true;
export const FORCE_QUERY_PLAN = true;
Hi @janis91, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.
Like already issued in the "old" repository (https://github.com/Azure/azure-documentdb-node/issues/78) it would be a big improvement if the new library would support direct mode / tcp connection. The performance with http calls is quite bad actually. In the linked issue this is already planned since 2015.
Maybe this can be integrated easier in to the new code base.