Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.07k stars 1.19k forks source link

Direct mode / tcp support #4807

Open janis91 opened 6 years ago

janis91 commented 6 years ago

Like already issued in the "old" repository (https://github.com/Azure/azure-documentdb-node/issues/78) it would be a big improvement if the new library would support direct mode / tcp connection. The performance with http calls is quite bad actually. In the linked issue this is already planned since 2015.

Maybe this can be integrated easier in to the new code base.

christopheranderson commented 6 years ago

Hey @janis91 - thanks for moving this here. This is something that @southpolesteve and I would really like to do. This likely will be done in stages (non-session writes are really easy, session consistency reads are pretty complex). We also won't work on this until we've gotten the new SDK released to GA since we cannot deprecate the old SDK until this one has GA'd which doubles the cost for us.

Short version: We really want to do this, but it likely won't start until after the new version GA's.

janis91 commented 6 years ago

Hi @christopheranderson, you're welcome. Actually I already expected that this won't make it into the initial version, here. I like the new approach with the async/await style / promise-based methods. So I think this will already be a big step forward. And after that I am hoping for further improvement :-)

christopheranderson commented 6 years ago

Absolutely. We're also moving all(*) our development onto GitHub, so you should be able to track the improvements as they are in the works. :)

FYI - we're doing some user studies on the new model. If you're interested, email me at chrande (at) microsoft (dot) com. We have "final" round before our first preview release starting the 23rd.

(*) - There might be "surprise" features that get developed on a private feature branch, but they'll be merged into the main dev/master branches. We'll try to do this as little as possible.

janis91 commented 6 years ago

Unfortunately, we are not able to do this at the moment. But we are still looking forward to the release :-)

tony-gutierrez commented 6 years ago

Ah crap, the only reason I started implementing this SDK was because I thought it supported direct?!

christopheranderson commented 6 years ago

This SDK will support Direct, but it wasn't a requirement for GA. It's the next feature on my queue, though, if that makes you feel better (but it's a heavy feature, so still a bit out).

tony-gutierrez commented 5 years ago

Any progress?

hassellof commented 5 years ago

Also waiting for this. Getting much better performance with MongoDB driver at the moment, but would prefer to use this to get support for multi master. However, current performance makes it faster to run a single master on a region on the other side of the planet and use MongoDB driver than this one in the same region.

tony-gutierrez commented 5 years ago

@hassellof, can you elaborate on the stack? What node js mongo library do you use?

tony-gutierrez commented 5 years ago

@christopheranderson ?

tony-gutierrez commented 5 years ago

@southpolesteve ?

christopheranderson commented 5 years ago

Sorry for the delay in response here.

No news on Direct Mode support for JS. We're in the final phases of Direct Mode for Java (HTTP is out and live in 2.4.0 and TCP is feature complete but not at the quality bar we need). We'll evaluate how we want to approach the next Direct Mode implementation once we've gotten Java to the place we need it to be.

RE: performance - would appreciate any numbers you're willing to share (what do you need/expect vs what you're currently observing). I've worked with a few customers to help make Gateway mode work fine performance wise.

tony-gutierrez commented 5 years ago

Any update?

tony-gutierrez commented 5 years ago

This issue isn't on the V3 feature list?

antempus commented 5 years ago

@christopheranderson what kind of information/metrics would be valuable to get more movement behind direct mode?

We've already started to batch requests to the API in lieu of singleton's due to the SNAT port limitations on Azure App Services; we'd really benefit from the direct mode as we move forward due to the throughput needed.

southpolesteve commented 5 years ago

@tony-gutierrez @antempus As Chris mentioned in the v3 issue, implementing Direct Mode is a big undertaking for us. 8 months, 3 devs, when we did it for Java.

When I work with other customers on perf problems, we often find Direct Mode is not the solution. Direct mode is primarily an optimization for latency, not throughput.

In the case of port exhaustion, its likely Direct Mode will make it worse. Consider an API exposing a cross partition query that hits 50 partitions with maxDegreeOfParallism = -1. At 20 concurrent API requests, node will attempt to open 1000 concurrent connections. The node default for an agent with keepAlive: true is 256 so you'll be port constrained by the agent. The queries will get slow (surfaces as latency), but the real problem is throughput.

In gateway mode, all connections go to the same host so they are much easier for the agent to reuse. Direct mode will tax the agent even more since each connection can only be reused for the exact same partition. You'll see a lot more stale connections pruned from the server-side and pay more costs for reconnection/TLS/etc.

Some ideas on how to get more perf out of the current SDK:

  1. Increase agent maxFreeSockets. I worked with a customer this week who did it and saw huge improvements. I understand Azure App service doesn't have great limits here. You'll have to bring that up with their team.
  2. Turn off keepAlive. If your traffic is spikey it might not be benefiting you much. Node's default maxSockets without keepAlive is infinite. Likely this increases your p50 query latency but may help throughput.
  3. Tune maxItemCount. If omitted, the backend will pick its own which may not be the best for your scenario.
  4. Tune maxDegreeOfParallelism. "-1" is full parallelism and probably not the best option. Can easily exhaust ports or CPU.
  5. Page buffering. This is on our end. We are working on it: https://github.com/Azure/azure-cosmos-js/pull/397
  6. Queue writes on your end. Maybe we could add SDK features to help here? I am open to ideas.
  7. Avoid cross partition queries altogether. Or at least avoid scenarios where they are executed concurrently on the same box. Caveat: If you are seeing high latency for a single partition query please share. I would like to investigate more.

I would be happy to discuss your specific workloads, help debug the perf issues, and chat alternative partitioning strategies. Drop me an email stfaul@microsoft.com

tony-gutierrez commented 5 years ago

Steve, while I appreciate the suggestions, I feel like most people who are clamoring for direct mode have probably already tuned to the point of latency being the limiting factor.

Node is not a limiting factor. The ports on app service are a limiting factor already, so why not have their usage be accomplished faster with one less network hop?

We have pretty constant high db traffic. We have been all over the map with keep alive, custom agents, and port limits. Our current mostly stable configuration is agentkeepalive (because MS products close all connections after 120 seconds, no matter what, and it's the only agent that has a socket TTL setting that can be set to 110 seconds) with the following config:

// attempt to keep global socket usage under 160 for app service.
agentConfig: {
        keepAlive: true,
        maxSockets: 25, // this is per host
        maxFreeSockets: 10, //per host
        timeout: 60000,
        freeSocketTimeout: 30000, //not used if using normal agent
        socketActiveTTL: 110000
    }

With this setup we will usually have 3 Cosmos hosts, with the sockets maxed. I would much rather have 10 times the hosts (although Cosmos always seems to fit our data into 5 partitions) with fewer sockets allowed and faster execution of queries by id (95% of our db traffic). But even better would be TCP direct mode, eliminating the overhead of the HTTP agent all together, and the overhead of all the headers and query string parameters that caused the recent overflow issue.

We stay fast by spreading the load over many more servers than we really need, just due to the bottleneck of cosmos connections and latency of the additional network hop.

antempus commented 5 years ago

@southpolesteve

Great info and we will give this a shot to see if the tuning helps; I think it will help in our dev/cert env, but in prod, we're going to see orders of magnitude more spikey requests to the API.

What's interesting is that the majority of the challenges stem not from Cross Partition queries, but rather the sheer number of request coming into the App Service; we're already batching the POST to the API but still see errors on the API.

I'll try these suggestions and bump the App Service up to the now lower priced P1V2 and post the results.

southpolesteve commented 5 years ago

@tony-gutierrez Can you share some of the latency #s you are seeing and at what kind of load? If point reads are slow, I would like to dig into that more. I will try to repro on app service.

It would also be helpful to know if requests are being queued. You can grab it off of agent. Some code I have used before to log this:

const agent = new Agent({
  keepAlive: true
})

setInterval(() => {
  Object.values(agent.requests).flat().length
}, 1000)
tony-gutierrez commented 4 years ago

Approaching half a year later... Any updates on direct mode?

MuhamedSalihSeyedIbrahim commented 2 years ago

Hi Team - any update on the direct mode feature?

jay-most commented 2 years ago

@antempus s What were your results?

antempus commented 1 year ago

@jay-most you can mark my comments are no longer valid, I've moved on to different work/company and I cannot recall why we needed, or thought we needed, this feature.

sabharwal-garv commented 1 year ago

Hi @jay-most can you confirm when direct mode will be available, as with Gateway mode we are seeing 4 to 5 second latency when we have spike in load in Azure metrics,  or can someone please suggest what can be the optimal configuration to reduce latency.

Container Configuration:

MAX RU's: 25000 with autoscaling enabled.

Query configuration :

We are querying on the basis of partitionKey and Index filed.

export const MAX_DEGREE_OF_PARALLELISM = -1;

export const MAX_ITEM_COUNT = 1000;

export const BUFFER_ITEMS = true;

export const FORCE_QUERY_PLAN = true;

github-actions[bot] commented 7 months ago

Hi @janis91, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.