dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.23k stars 1.57k forks source link

need a way to have visibility into the amount of resources that the AS is consuming #24020

Closed devoncarew closed 9 years ago

devoncarew commented 9 years ago

Related to https://github.com/dart-atom/dartlang/issues/189. We have situations where the AS process consumes large amounts of RAM, and/or persistent high CPU usage. We'd like a way to be aware of this on the client end. It may be difficult to get precise resource usage (RAM / CPU), but hopefully there are some suitable proxies. Number of contexts, size of each context, ... This could be exposed as a specific API that clients call, or perhaps as a periodic notification.

@sethladd @bwilkerson @stereotype441

sethladd commented 9 years ago

In the meantime, would the editor's integration with AS be able to get info like size of process? I assume it controls the process, because it spawns it?

bwilkerson commented 9 years ago

It may be difficult to get precise resource usage ...

Actually, it's impossible. There is no way to get any information like this from the VM within the running process. I'd really like to be able to know how much memory we're using in order to better manage our memory usage, but I can't.

... but hopefully there are some suitable proxies.

I'm sure we could come up with something, but I'm not convinced that Seth's suggestion of just asking the OS for the information wouldn't be better. Or alternatively, use the Observatory API to query the VM.

This could be exposed as a specific API that clients call, or perhaps as a periodic notification.

I don't know what triggers we'd use to know to send a notification. And given the amount of time required to iterate over the data to compute the values to send, I'd be concerned about the performance implications.

stereotype441 commented 9 years ago

On Mon, 10 Aug 2015 at 08:04 Brian Wilkerson notifications@github.com wrote:

It may be difficult to get precise resource usage ...

Actually, it's impossible. There is no way to get any information like this from the VM within the running process. I'd really like to be able to know how much memory we're using in order to better manage our memory usage, but I can't.

... but hopefully there are some suitable proxies.

I'm sure we could come up with something, but I'm not convinced that Seth's suggestion of just asking the OS for the information wouldn't be better. Or alternatively, use the Observatory API to query the VM.

This could be exposed as a specific API that clients call, or perhaps as a periodic notification.

I don't know what triggers we'd use to know to send a notification. And given the amount of time required to iterate over the data to compute the values to send, I'd be concerned about the performance implications.

I can think of some possible proxy data that would be useful to collect and quick to compute, such as:

(This would be similar to the hidden analysis view we had in the old Java-based dart editor; collecting and showing that data didn't seem to be a performance problem.)

As far as when to send the notification, I imagine we could send it periodically while analysis is active (once per second perhaps) and once whenever analysis goes idle.

Paul

devoncarew commented 9 years ago

@sethladd, the API I'm using does not expose the pid unfortunately.

And I think using the observatory is a bit problematic for this use case. I want to use this API to be aware proactively that there are issues. The observatory is better for diagnosing issues after the user knows that there's a problem.

@bwilkerson, totally agree that doing any measuring of performance issues, that could itself cause performance issues, should be something to avoid :) Something that was light-weight and easy to compute, but that was still a good measure of the resource consumption of the analysis server, would be ideal.

bwilkerson commented 9 years ago

It occurs to me, though, to wonder what value that data has unless you have some way to respond to the data. If the user opens a number of large projects and server uses lots of memory, what will you do? Restarting it won't necessarily help. (Perhaps another way of looking at the same thing is to ask: what data could we provide that would be actionable?)

devoncarew commented 9 years ago

This is to help be aware of issues like https://github.com/dart-atom/dartlang/issues/189 earlier. The data we need from the server is information that we could surface to the user in some fashion, in order to make them aware of extreme resources usage situations. Many of the ideas listed earlier in this bug report would work for that.

danrubel commented 9 years ago

Seems like a good idea to provide users some high level status about the analysis server

@devoncarew This will likely be platform dependent. Perhaps provide status on Linux/Mac for now? Something like what's returned by ps -p <pid> -o pid,state,cpu,%cpu,%mem,vsz ?

@bwilkerson Would adding this information to the analysis server status page sound reasonable?

bwilkerson commented 9 years ago

This will likely be platform dependent.

The information that Devon is asking for would not be platform dependent.

Would adding this information to the analysis server status page sound reasonable?

I don't understand what good that would do for either clients or users. Neither has access to the status pages under normal conditions. (They are only intended to be used for debugging.)

stereotype441 commented 9 years ago

I think we may have had some miscommunication about this request. The request came from a conversation Devon and I had, and I believe the information we were trying to ask for is platform independent. Let's try to resolve it in person this morning.

On Fri, 14 Aug 2015 at 07:44 Brian Wilkerson notifications@github.com wrote:

This will likely be platform dependent.

The information that Devon is asking for would not be platform dependent.

Would adding this information to the analysis server status page sound reasonable?

I don't understand what good that would do for either clients or users. Neither has access to the status pages under normal conditions. (They are only intended to be used for debugging.)

— Reply to this email directly or view it on GitHub https://github.com/dart-lang/sdk/issues/24020#issuecomment-131131451.

danrubel commented 9 years ago

This will likely be platform dependent.

The information that Devon is asking for would not be platform dependent.

@bwilkerson Sorry... let me clarify. The information is platform independent, but will probably need to be obtained in a platform dependent manner by calling out to the OS. I don't know of a way to get CPU and RAM usage via dart:io other than calling out to the OS.

Would adding this information to the analysis server status page sound reasonable?

I don't understand what good that would do for either clients or users. Neither has access to the status pages under normal conditions. (They are only intended to be used for debugging.)

Good point. Perhaps a new API or additional information in the server.status notification. This seems like useful information for the client to display the the user... overall server health similar to the "analyzer is running" progress bar that shows up in IJ, DE, DPfE, DPfA, etc. Some clients could overlay %CPU and RAM usage overtop the analyzing progress bar as one possible idea.

bwilkerson commented 9 years ago

I don't know of a way to get CPU and RAM ... other than calling out to the OS.

Note that Devon is willing to take proxies for these values as long as they provide the information he needs, and that these proxies would not require platform-specific code.

danrubel commented 9 years ago

Sgtm

bwilkerson commented 9 years ago

This is to help be aware of issues like dart-atom/dartlang#189 earlier. The data we need from the server is information that we could surface to the user in some fashion, in order to make them aware of extreme resources usage situations. Many of the ideas listed earlier in this bug report would work for that.

I still don't understand what value this information has for the user. In the referenced issue, the user clearly had information about how much memory was being used (better info than any of our proxies would have given), but didn't know what to do about it. I don't understand the value of providing this data if the data isn't actionable.

For example, what would/should the user do if we told them that they had 14 analysis contexts (whatever those are) that were analyzing an average of 217 files each. Is that ok, or is that a problem? Should they close some projects? Should they restart their editor? I think the data we're proposing to given them is meaningless to them.

To be clear, I agree that there is a real problem here and that we should fix it. I'm just not convinced that we've hit on an actual solution.

sethladd commented 9 years ago

There are two objectives:

For the second case, what are good signals for the client to look at? Latency in response times? Keepalive pings?

bwilkerson commented 9 years ago

Be able to report resource utilization up through analytics to get a sense of AS in the wild. Do we see trends in latency, memory, crashes, cpu spikes ?

I assume we want the server to send as much of this data to analytics as possible (rather than the client). Of course, only clients can report all server crashes (server tries to be safe and send a notification to the client when it's shutting down unexpectedly, but we can't be 100% safe).

As discussed, we can't get actual memory or cpu usage, so we'll have to make due with some kind of proxy value. It's easier to imagine a proxy for memory usage than for cpu usage; I have no idea how to get the latter. Whatever we come up with, we can send it directly to analytics rather than to the client.

As for latency, on the client side we can measure how long it takes from the time a request is sent until the time the client notices that the response has arrived. But clients can't know when the request actually arrived, only when they finally noticed that it was there, which can skew the numbers.

We have a similar problem on the server side. We can know how long it took from the time we noticed a request until we sent a response, but we don't know how long the request was sitting on the event queue before we got it. (It would be nice if events on the queue were timestamped.) That skews the numbers as measured on the server as well.

The latency as measured on the client is the most important from the users perspective, but is somewhat out of our control. It might be nice to have both values, but I don't know how many clients are willing to add the support for sending us analytics.

Clients must be proactive in restarting AS if AS begins to get into a state where it degrades performance for the user.

That's fair. (Although I'll point out that the information is therefore for the client, not the user.)

While I agree it's worth doing, I think it's going to be hard to define good measures that give accurate information for this purpose.

For the second case, what are good signals for the client to look at? Latency in response times? Keepalive pings?

Latency is probably the most important because it impacts things like how long it takes to get code completions displayed. Another might be how long after a change before analysis is complete because that impacts things like how long before errors and warnings are updated and when a search or refactoring can be performed.

pq commented 9 years ago

Paging @lukechurch. Thinking he may have pointers based on prior art.

devoncarew commented 9 years ago

I'd like to track the analytics aspects in a separate issue. For this issue, we want to have an easy way to bring potential resource consumption issues to the attention of the user. Additionally, we would surface a more detailed view for power users that were interesting in helping to diagnose the issue. This would likely be surfaced here:

screen shot 2015-09-30 at 12 49 38 pm
sethladd commented 9 years ago

to the attention of the user.

What would the user do with that info?

I want my client to just restart AS when AS hits a bug. Perhaps the client should inform the user "Restarting Dart plugin, stand by..." and then send that as an analytics event with details of resource consumption.

I definitely want resource info to be part of any bug report.

devoncarew commented 9 years ago

Auto-restart might have issues if the analysis server will just get into the same state as before. I'd lean towards making the user aware of the issue and give them tools to self-serve, in terms of being able to bounce the analysis server and report issues with good information.

bwilkerson commented 9 years ago

Additionally, we would surface a more detailed view for power users that were interesting in helping to diagnose the issue.

It depends on what you mean by "power users", of course, but if they really want to help diagnose the issue (which would be fantastic!) perhaps just add a "Open Status Page" button that would launch our status pages. The status pages show all kinds of information about the inner workings of analysis server and doesn't need to worry about being backward compatible from one release to the next. Much easier than adding a new API to populate this view.

These pages already display most of the proxy measures that have been mentioned here.

But as I said before when Dan suggested this, I don't think this is appropriate for most users, and I'm skeptical it's even appropriate for power users.

devoncarew commented 9 years ago

perhaps just add a "Open Status Page" button that would launch our status pages. The status pages show all kinds of information about the inner workings of analysis server and doesn't need to worry about being backward compatible from one release to the next. Much easier than adding a new API to populate this view.

We have this already.

bwilkerson commented 9 years ago

Auto-restart might have issues if the analysis server will just get into the same state as before. I'd lean towards making the user aware of the issue and give them tools to self-serve, in terms of being able to bounce the analysis server and report issues with good information.

"bounce" sounds like "manual restart", which won't be any more effective than an auto-restart (with the exception that the user will eventually stop restarting server when they realize that it isn't doing any good :-).

Again, what information do you want to display to the user that they could make sense of? Could you provide me with a concrete message that you would display that you think would be useful to the user?

sethladd commented 9 years ago

After a discussion in person, we're going to do a little more work trying to see if we can get the PID for the analysis server.

danrubel commented 9 years ago

If you can't, then the analysis server can return its own pid via dart:io pid and a new API or augmenting an existing API.

On Wed, Sep 30, 2015 at 6:00 PM Seth Ladd notifications@github.com wrote:

After a discussion in person, we're going to do a little more work trying to see if we can get the PID for the analysis server.

— Reply to this email directly or view it on GitHub https://github.com/dart-lang/sdk/issues/24020#issuecomment-144554370.

sethladd commented 9 years ago

If you can't, then the analysis server can return its own pid

Ah, thanks very much! We might just take you up on that offer :)

lukechurch commented 9 years ago

Let me try and help out here.

People develop a 'spider sense' for 'this is a well behaving tool', 'this isn't'. In the physical world we use all sorts of signals to build this 'feeling'. E.g. the vibration in a car steering wheel as you're driving it.

Digital things are on a whole polished glass boxes, which lack a lot of these signals. So people need other ones. When the system gets really reliable people can start to use its behaviour as a signal. E.g. we saw people using code completion stopping working as a signal that they'd got something wrong with their code.

On the whole our tooling isn't there yet. So when people get strange behaviour they don't know whether it's their code being broken or the system misbehaving. They will often assume the former and waste a lot of time. Anything we can do to add signals that people can start using to build a 'this is normal, this isn't' feeling will help a lot in the user knowing when to keep pushing on or when to repair.

So e.g. RAM usage. If the user builds an instinct that 'my project usually takes AS about 1Gb to run', the system starts behaving funny and they notice it's using 4Gb - then they might correctly short cut to the realisation that it was AS. And click the ''Something is dodgy, send a perf report and restart' button.

My instinct for what 3 useful metrics would be are:

I think putting these three numbers somewhere continuously on the UI at low intensity might well help build the 'spider sense'.

clayberg commented 9 years ago

I think Luke is dead on here. Let's find a way to provide one or more of these metrics soon. These would be very useful in both Atom and IntelliJ.

bwilkerson commented 9 years ago

I have no problem having clients display this information, but server does not have this information so it cannot be provided via a server API. The clients need to get this information from somewhere else.

What server can do is provide the PID of the server process (for which Seth just added a new issue).

sethladd commented 9 years ago

My instinct for what 3 useful metrics would be are:

Are these things the AS should return, or the client should be able to determine?

I hesitate to request the AS send me those three metrics, because if the AS is acting wonky, then I might not get answers to those questions. I feel like I need to get those answers "out of band".

I did file https://github.com/dart-lang/sdk/issues/24477 to request that AS make it easier for clients to get that info.

Therefore, perhaps we should close this specific issue, and open two more issues, one for Atom and one for IntelliJ. Maybe we even write a Dart library to get this diagnostic info, given a PID :)

devoncarew commented 9 years ago

So, to follow up from yesterday:

I am able to get the pid for the process we launch for the analysis server. I was able to find it by digging through the node.js code. I'm not sure yet whether this will work on windows, but on the mac and linux we can periodically shell out to some OS commands (ps, ...) in order to scrape cpu and memory usage info.

It would still be nice for clients to be able to opt in to a periodic heartbeat from the analysis server. The payload could be metrics that are easy for the AS to calculate and that would act as good proxies for the resource consumption of the AS. I would see this API as generally useful to clients. It would mean that the monitoring code could be written in one place instead of each client writing its own monitoring code.

sethladd commented 9 years ago

I am able to get the pid for the process we launch for the analysis server.

Unless we can reliably get this across OS's, I'm in favor of the AS sending this. That makes it easy for all clients to get. It also sounds easy for the AS to send, and then you don't need to scrape the output of OS commands :)

It would still be nice for clients to be able to opt in to a periodic heartbeat from the analysis server.

Can you open a new issue that specs this out?

I'm going to close this issue as it sounds like we've reached a few concrete things we can do. Thanks for all the discussion, glad to see we're making progress!