clojure-lsp / clojure-lsp

Clojure & ClojureScript Language Server (LSP) implementation
https://clojure-lsp.io
MIT License
1.16k stars 152 forks source link

Infinite analyze loop after changing lots of files outside editor #796

Closed ericdallo closed 2 years ago

ericdallo commented 2 years ago

@borkdude reported with more details here, it seems clojure-lsp gets stuck on a inifinte loop scanning the file changes watched, causing high CPU easily.

We should try to make a repro, probably 2 different branches with lots of different files should cause this

ericdallo commented 2 years ago

For now, restarting with lsp-workspace-restart should reset this state

ericdallo commented 2 years ago

@borkdude I couldn't repro the infinite loop exactly even on clj-kondo repro changing branches that have multiple changed files, but I improved the code a lot to avoid possible infinite loops/performance issues, before we were calling clj-kondo to analyze each new file, now we call one time passing all new files.

I'm not sure this will solve your issue though, could you test the latest master? you can find a nightly build on #clojure-lsp-builds

borkdude commented 2 years ago

It happened to me several times in the last few days on different projects (clj-kondo, SCI). Every time by a git change (reset or branch switch).

borkdude commented 2 years ago

Will test.

ericdallo commented 2 years ago

@borkdude mentioned he still faced the issue on git branch switch, it seems not related with didChangeWachedFiles at all, but related with some looping where we analyze with clj-kondo, which could be:

mainej commented 2 years ago

Here's the research I've been working on today. Sorry for taking so long to put it together. As I've been writing I've been testing things and my understanding keeps evolving. Here's where I am...

You can get into a bad state in file-management.analyze-changes. Here's a simplification of that function, with some numbers added to make it easier to talk about.

(defn analyze-changes [text db]
  (loop [old-db @db] ;; 1
    (let [kondo-result (lsp.kondo/run-kondo-on-text! text db) ;; 2
          new-db (merge-kondo-result old-db kondo-result)] ;; 3
      (if (compare-and-set! db old-db new-db) ;; 4
        (follow-up) ;; 5
        (recur @db) ;; 6
        ))))

That function calls clj-kondo (2). It tries to merge the kondo results into the db with compare-and-set! (3 and 4). It uses compare-and-set! because other threads might be modifying the db while the analysis was running and it doesn't want to overwrite those changes. If compare-and-set! succeeds, it goes on to do some follow up work, and returns from the function (5). If it fails, it tries again with the latest version of the db (6), hoping that this time no other threads will be touching the db.

This looks like it should work, but sometimes it doesn't. With a bunch of logging, I discovered that when analyze-changes gets stuck in a loop, compare-and-set! always fails because old-db is not identical? (i.e., is not the same object) as what is currently in the db atom. It is, however = (i.e. is the same value). compare-and-set! is built around object identity, so if the objects are = but not identical? it will fail.

But how is that possible? Line 6 derefs the db atom, so old-db should be identical? to what's in db.

The culprit is line 2. Did you notice that it takes the db atom as an argument? (I didn't for a very long time.) What happens inside clojure-lsp.kondo/run-kondo-on-text!? A few levels down, in single-file-custom-lint!, it asynchronously swaps lint findings into the db! This is really bad, though why it's bad is subtle...

First, it's important to note that lint findings are sometimes = but never identical? to what was in the db before. I'm not exactly sure why they're new objects every time (perhaps some sort of deserialization somewhere), but that's beside the point.

So... there are two cases. If the async swap! happens after the compare-and-set! everything is fine. But if it happens before the compare-and-set!, it taints the db with a non-identical value. Then compare-and-set! fails, and we invoke clojure-lsp.kondo/run-kondo-on-text! again. If the linting keeps finishing first, we get stuck in an infinite loop.

In the last half hour or so I've finally figured out how to reliably reproduce the infinite loop. And I think this method hints at why it sometimes happens when changing lots of files.

To reproduce it, you have to do two things. Together, these reliably make the swap! happen before the compare-and-set!, starting the infinite loop. First, in analyze-changes, somewhere between lsp.kondo/run-kondo-on-text! and compare-and-set!, you need a small sleep—(Thread/sleep 100) is enough. Second, in lsp.kondo/single-file-custom-lint! you need to replace (if (:processing-changes @db) ... with (if false....

So, why do you have to hard-code :proccessing-changes to false? The compare-and-set! sets it to false, which is supposed to ensure the swap! follows the compare-and-set!. So how would it ever be false before compare-and-set! finishes? Well, I'm a little hazy on that, but here's what I think. Even though :proccessing-changes is a single flag for the entire system, each file takes the liberty of turning it on and off for itself. If two files start processing changes at the same time, the first to finish will set :processing-changes back to false before the other is done. That sets up the conditions for the infinite loop. And I think it suggests why it happens when switching branches: you're more likely to have several files change simultaneously.

@ericdallo it may be possible to make this less frequent by batching up multiple files, but I think that's the wrong approach. We can never guarantee that some sequence of events won't set :proccessing-changes to false prematurely. I think a better solution would be to make :processing-changes a set which stores URIs of files that are currently being processed. Then individual files could sequence their changes without affecting other files.

That's a simple solution anyway. I guess another way would be to remove the async-custom-lint feature. Is it really so slow that it needs to be async? It looks like most or maybe all of what it does is in-memory manipulation.

OK, I'm exhausted. I'll push a branch that has the debugging code in it, although I don't think we want to keep it. And tomorrow I'll try turning :processing-changes into a set. I've had several hypothesis about what's happening, so I'm still not 100% sure I'm on the right track.

Oh, and one last thing... the debugging would have been easier if the db atom was only passed to functions that actually need to change it. That would make it much easier to read the code and reason about what might be causing updates to the db. Would you be interested in a PR for that?

mainej commented 2 years ago

Here's the debugging code: https://github.com/mainej/clojure-lsp/compare/line-numbers-in-logs...debug-infinite-loop?expand=1

ericdallo commented 2 years ago

Thank you for the research @mainej! I was expecting something like that indeed, we already had 1 or 2 issues like that because of that atom being swaped on async-custom-lint-fn but I coudln't find a better way:

I think a better solution would be to make :processing-changes a set which stores URIs of files that are currently being processed. Then individual files could sequence their changes without affecting other files.

Sounds good to me, that should be enough to make that flag only aware of specific files.

I guess another way would be to remove the async-custom-lint feature. Is it really so slow that it needs to be async? It looks like most or maybe all of what it does is in-memory manipulation.

Yes, it's a function that I already spent some time improving performance, its slowness grows depending on the project size and how may vars/var-usages it has, making async is required otherwise huge projects would take even minutes to initialize, the logs help a lot understanding how much time that takes, and IMO is the feature which needs most performance improvements on the whole clojure-lsp.

Oh, and one last thing... the debugging would have been easier if the db atom was only passed to functions that actually need to change it. That would make it much easier to read the code and reason about what might be causing updates to the db. Would you be interested in a PR for that?

Usually that's that expected, we should pass db only if some function will need it. The idea db being pushed is to inversion of control so we can change db to some other component in the future or componentize things clojure-lsp need, I agree ATM it's not ideal but it's better than before where you would find swaps/derefs on pure functions

ericdallo commented 2 years ago

I can reproduce the issue following your steps @mainej, I can't make clojure-lsp top CPU though, it reaches a max of 15% for me, but I know this is relative to hardware. Even so, the idea of making processing-changes a set sounds like a good option.

borkdude commented 2 years ago

FWIW I have a 8 core machine and it consumes about 250% for me when this problem arises.

mainej commented 2 years ago

it reaches a max of 15% for me -- ericdallo

I have a 8 core machine and it consumes about 250% for me -- borkdude

I also have plenty of cores, and when it gets stuck in the loop it completely saturates 1 or sometimes 2 cores. So I think borkdude and I are seeing the same thing, and you haven't reproduced it yet @ericdallo. Maybe a longer sleep? If you add some logs before the swap! and before the compare-and-set! you should see that when swap! comes first it gets stuck in the loop.

we already had 1 or 2 issues like that because of that atom being swaped on async-custom-lint-fn but I coudln't find a better way

I think making :processing-changes a set will help, but I agree, there may be an even better way. Perhaps a per-file agent that receives and applies changes for that file? Or an async channel that receives all the db updates and applies them in sequence? I don't know... I don't have much experience designing large async apps.

it's better than before where you would find swaps/derefs on pure functions

I guess that's what I was trying to say... there are still many functions that could be pure, but aren't. All they do is receive and deref the atom, making them impure. If they received the value, not the atom, they'd work the same and be pure. Some of these functions are pretty low level, so they "infect" the whole app. Anything that calls them, directly or transitively, has to take the atom as an argument.

Some of the worst offenders are settings/get and settings/all. Since so many things use the settings, we have to pass the atom around everywhere. Currently they need the atom because they memoize on the identity of the db (but not its current value). But, I think there are three ways around that.

  1. We could calculate the settings much higher in the call hierarchy, then pass them down to the functions that need them.
  2. We could memoize just on the things in the atom that are used to calculate the current settings–project-root-uri settings and force-settings. Those don't change often and so they'd be a good memoization cache key.
  3. We could observe that the db is a defonce atom, so its identity never changes. Using it as a cache key is the same as using any other constant. Therefore currently the memoization is essentially a global timer on how often to call config/resolve-for-root. The memoization could be moved down to the config layer, meaning we'd cache on :project-root-uri (plus a ttl), not on the whole db atom.

its slowness grows depending on the project size and how may vars/var-usages it has, making async is required

That sounds like another thing that would be much faster if we had an index on bucket!

ericdallo commented 2 years ago

I guess that's what I was trying to say... there are still many functions that could be pure, but aren't. All they do is receive and deref the atom, making them impure. If they received the value, not the atom, they'd work the same and be pure. Some of these functions are pretty low level, so they "infect" the whole app. Anything that calls them, directly or transitively, has to take the atom as an argument.

I see in the future we using more components on clojure-lsp, like move settings out of db and have it as a separated component, and we could receive those components from top and leave inner functions pure.

The only memorization we need is for settings, as user can change settings on the fly and we want to get it updated, so we have memorization of 1 second for settings. I don't see the need of memorization for anything besides that right, IMO that only make things harder if not really needed.

That sounds like another thing that would be much faster if we had an index on bucket!

I'm not sure it would help that much as the part that has poor performance is the need to find references of each var-definition.

ericdallo commented 2 years ago

@borkdude We merged #821 which is another attempt to fix this issue, could you test the latest nightly build please? (generated in a couple of minutes from now)

borkdude commented 2 years ago

Will do!

mainej commented 2 years ago

the part that has poor performance is the need to find references of each var-definition

That makes sense. The limitation on indexes for filename and bucket are that they give you quick access to elements that have a particular attribute, but not to elements with a particular attribute and value. Something to think about. Do you have an example project where the custom lint performance is bad @ericdallo?

move settings out of db and have it as a separated component

Yep, that makes sense. One way to think about it is that the memoized function is its own tiny time-varying database already. So, decoupling it from the db makes sense.

we have memorization of 1 second for settings. I don't see the need of memorization for anything besides that right

I think we're in agreement, but I'd clarify... We need the 1 second ttl to read the user's config. And with the 1 second ttl, settings/get will never be pure. But, settings/get also never changes the db atom, so it doesn't need the atom, only its value. That is, with some care, we could have both: the settings would update every second, but settings/get would not need the atom. When I get some time I'll put together code to show how that would work. Eventually settings could be refactored to be a separate component, but that would be a different project.

ericdallo commented 2 years ago

Do you have an example project where the custom lint performance is bad @ericdallo?

Pretty much any huge project, it scales with the project size, you can notice with projects like clj-kondo or bigger.

I think we're in agreement, but I'd clarify... We need the 1 second ttl to read the user's config. And with the 1 second ttl, settings/get will never be pure. But, settings/get also never changes the db atom, so it doesn't need the atom, only its value. That is, with some care, we could have both: the settings would update every second, but settings/get would not need the atom. When I get some time I'll put together code to show how that would work. Eventually settings could be refactored to be a separate component, but that would be a different project.

Sounds good

snoe commented 2 years ago

@mainej great detective work here.

That's a simple solution anyway. I guess another way would be to remove the async-custom-lint feature. Is it really so slow that it needs to be async? It looks like most or maybe all of what it does is in-memory manipulation.

There's a number of levels of asynchronicity.

For buffer changes, originally, I had worked hard to fit analysis into a fixed time window; under 16ms for clojure/core.clj edits and used a synchronous request. In reality that's probably not practical because some clients will lock the buffer until lsp's on-change returns and if you miss that window it feels awful. So that means that there has to be a level of asynchronicity between typing and our state being in line with the editor & file system. Nonetheless, faster analysis should be strived for because it does a good job of hiding all these other issues.

However, I also wonder if the rest of the events should be processed through a queue with blocking operations.

For analysis, there's a few use cases to keep in mind. I use rename as an example of a user request that affects multiple files but find def and find references are similar but (slightly) less destructive if we get it wrong.

Analysis can quickly be outdated and the ability to know which to keep or throw out is difficult. Here's some of the use cases I think are important to consider with a possibility of how a queue could be updated just before processing.

mainej commented 2 years ago

@snoe thanks for the additional context. As I've played with this more, I'm coming to understand the importance of the different layers of asynchronicity.

I like your idea of a prioritized event queue with an algebra for commuting and combining events. That sounds challenging to specify, as your use-cases demonstrate, and challenging to implement. If you can get it right though, I imagine it would be easier to reason about. It sounds more elegant and more correct than :processing-changes which was essentially a global mutex on changes and is now a per-file mutex.

I use rename as an example

I don't see any code that tries to avoid calculating a rename, or any other edit, while changes are being processed. Am I missing something or is that a bug?

faster analysis should be strived for because it does a good job of hiding all these other issues

It's true, faster analysis (and linting) reduces the frequency of potential bad interactions between these async processes. But there's a tradeoff. If we took more time on post-kondo cleanup—normalizing and indexing the analysis data—we could speed up the queries, making many other things faster. As a user, I know that sometimes when clojure-lsp isn't showing me what I expect I save the file to "nudge" it. (99% of the time clojure-lsp is correct and my expectations are wrong, but I'm always so sure the tool must be mistaken! :) So, as I'm waiting, I want the analysis to be fast. But when I'm waiting for a symbol to highlight, or a completion to pop up, I want the queries to be fast. The project leans very heavily toward keeping analysis fast (i.e. we just dump the kondo analysis in the db, without optimizing it much for future queries). Having worked on a few performance problems in the last few releases, I know that means the queries can be slow. I'd like to find a way to speed up the queries, but it's good to remember that keeping analysis fast is a priority too.

Reading back through this issue, it strikes me, making :processing-changes a set of uris was probably helpful, and will make the infinite loop less frequent. But I suspect we'll continue to see it happen occasionally. The problem is that even with debouncing it's possible (I suspect) to end up with two or more running analysis jobs for the same file. If that happens, when the first one finishes it will remove itself from :processing-changes, making it possible for the later ones to get in the infinite loop. It might be possible to reproduce that by adding a little sleep between the kondo call and the compare-and-set!, then spamming the current-changes-chan. I'll try that when I have a chance. Some sort of smarter queue-of-changes, like you suggest, is definitely a better answer than :processing-changes.

It also occurs to me that if the new lint data is = but not identical?, there's no reason to swap! it in. That's another easy way to avoid the infinite loop. I'll plan to make a PR for that.

ericdallo commented 2 years ago

:processing-changes

This was totally a workaround to avoid the async changes :( I'd love to improve that somehow

I agree with everything! making analysis faster it's a priority indeed

borkdude commented 2 years ago

Just for my info,

What’s the status of the infinite loop issue now?

ericdallo commented 2 years ago

@borkdude our idea for the fix was merged on https://github.com/clojure-lsp/clojure-lsp/pull/821, you can try a nightly release

borkdude commented 2 years ago

Then I've already started doing that, thanks.

snoe commented 2 years ago

I use rename as an example

I don't see any code that tries to avoid calculating a rename, or any other edit, while changes are being processed. Am I missing something or is that a bug?

I suspect it is, but my familiarity with the particulars of the code base is pretty out of date.

As a user, I know that sometimes when clojure-lsp isn't showing me what I expect I save the file to "nudge" it.

I do this too, and I chalk it up to problems with asynchronicity discussed here.

Having worked on a few performance problems in the last few releases, I know that means the queries can be slow. I'd like to find a way to speed up the queries, but it's good to remember that keeping analysis fast is a priority too.

My original plan was to put the database in a normalized form in a sqlite database for querying. But the challenge was always to get analysis to a point where it felt stable and fast enough to do so (switching to kondo was maybe the last hurdle). As well, time spent querying was far overshadowed by the time spent analyzing. So if queries are becoming the bottle neck it would make sense to speed them up now.

I think it would be useful to profile a few operations and figure out which ones are "too slow" and why. I would expect hover to be faster than highlight to be faster than find references to be faster than call hierarchy. Working with clojure/core.clj as a very large file will very quickly surface problems. Also working in a large project with 1000s of namespaces and 100s of dependencies would do the same (metabase looks like it'd fit).

The tradeoff to me was always can we make the dumb clojure queries faster and re-architect to avoid some queries altogether or do we switch to indexes which are going to be much more complicated to maintain or a lot of work to move to a different query "engine" that might not be faster. I spent no time optimizing the existing queries pre-kondo, and only a bit of time afterwards. Maybe it's time to bite the bullet, maybe not, I'll let y'all decide.

I bet there are still some low hanging fruit in there:

As well as more difficult questions:

If indexes are still required, my initial thinking was to create indexes in sql and basically run queries like select * from analysis where namespace = 'x' and name = 'y'. That might still be the best thing with datalevin, I'm not sure. It's also completely possible to manually create cached indexes in clojure from the current db as analysis comes back, I usually work in only 2 or 3 files but reference dozens, they should have stable analysis.

ericdallo commented 2 years ago

Agree, IMO we should check the tradeoff of using a index. Thank you for the questions @snoe , really good ones, I'll try to answer them:

Are we using transducers everywhere and composing them when possible?

I changed almost all places that makes sense and needs performance to use transducers, we could probably improve the composability but would not affect performance that much from my tests.

Are we mapping over analysis before filtering?

IIRC I already spent some time checking that as well trying to always filter what makes sense before, it may be worth double checking.

Are we looking at local analysis for functions or namespaces that don't apply?

AFAIK no, unless I missed some place too

Are we creating cross products and exponential big-O iterations?

probably only unused-public-vars because the need of find definitions and then the references of each one, but I already spent a long time improving performance of that, but probably we could re-think in some way to improve it.

Are we trading off speed in any cases for DRY?

Could you elaborate?

Are we spending time calculating things for files in the project that should only be done if they are open?

AFAIK no, if yes, we should probably change it, the features I recall that need checking not opened documents are:

If we took out feature "x" would things be significantly simpler or faster? Is there a middle ground?

Probably the only one: unused-public-vars, but that is already optional for users.

When new features get merged how do we know we're maintaining benchmark speeds?

we are not :( I usually check manually but there is a chance I miss something for sure, maybe we should have performance tests like clj-kondo has.

Is the queries namespace too hard to follow?

IMO no, it's one of the namespaces with the responsibility pretty clear with good tests and quite straightforward if want to improve

If indexes are still required, my initial thinking was to create indexes in sql and basically run queries like select * from analysis where namespace = 'x' and name = 'y'. That might still be the best thing with datalevin, I'm not sure. It's also completely possible to manually create cached indexes in clojure from the current db as analysis comes back, I usually work in only 2 or 3 files but reference dozens, they should have stable analysis.

Yeah, we dropped SQL and datalevin 😂 it turns out transit (the actual db ATM) works very fast for read and write, the same used by clj-kondo, if we would go with indices idea somehow, IMO we should just change the way we have analysis on the memory, the db impl is not the issue here, it's pretty fast even for big projects with transit ATM.

snoe commented 2 years ago

Thanks for the answers @ericdallo. I was also hoping that whomever looks at performance or builds new queries can keep these in mind.

Are we trading off speed in any cases for DRY?

Could you elaborate?

For example, and I don't know if it's actually a slow down, find-element-under-cursor; it's a helpful function but hover calls it in a loop, maybe a function that is more bespoke there could be faster even if it means 90% of the code is copied from find-element-under-cursor.

Yeah, we dropped SQL and datalevin

haha clearly my specific knowledge of where the project is these days is pretty useless :)

ericdallo commented 2 years ago

For example, and I don't know if it's actually a slow down, find-element-under-cursor; it's a helpful function but hover calls it in a loop, maybe a function that is more bespoke there could be faster even if it means 90% of the code is copied from find-element-under-cursor.

I understand your point, but this is probably a really corner case (when user is on an empty space for example), otherwise, the function will be O(1) most of the time. Still, your point made me notice that we can actually remove that loop just like we did recently for completion, what do you think @mainej ?

haha clearly my specific knowledge of where the project is these days is pretty useless :)

Not at all, the core idea is still the same, the project base was always pretty stable and well done!

mainej commented 2 years ago

remove that loop just like we did recently for completion

I don't think we can remove that one. Reading back through the issue, it's there because you really do want hover to return an element that's not the one under the cursor.

mainej commented 2 years ago

@borkdude please report here or in Slack if you keep having the problem. Since the latest fix I've thought of a way that it could still happen. I have another easy short-term fix to make the problem less likely. The rest of this discussion is about deeper, longer-term fixes.

mainej commented 2 years ago

My original plan was to put the database in a normalized form in a sqlite database for querying. But the challenge was always to get analysis to a point where it felt stable and fast enough to do so (switching to kondo was maybe the last hurdle). As well, time spent querying was far overshadowed by the time spent analyzing

I've been experimenting with importing the analysis into a Datascript DB, as an easy way to index it. I'll report my findings in a separate discussion, but the summary is: so far the experiment has failed. It fails for the reason you describe @snoe. The queries are extremely fast (and quite pretty) once you get the database populated, but when a file changes it takes too long to re-populate. I haven't given up totally—several pieces of the process could be faster—but I'd say it's not looking good.

@ericdallo in that experiment I was mostly benchmarking unused-public-var calculation. That is, I was paying attention to the total time from didChange -> clj-kondo analysis -> db update -> re-calculate unused public vars. So, I've been looking at the diagnostics namespace a lot. I've noticed a few ways we could try to make it faster. I'll report about that too.

ericdallo commented 2 years ago

 The queries are extremely fast (and quite pretty) once you get the database populated, but when a file changes it takes too long to re-populate. I haven't given up totally—several pieces of the process could be faster—but I'd say it's not looking good.

Yeah, I think we should focus on memory analysis improvement, like having it indexed by bucket or something.

mainej commented 2 years ago

we should focus on memory analysis improvement, like having it indexed by bucket

I should have clarified... I wasn't trying to replace the transit DB. The Datascript DB is an in-memory index, on bucket and every other attribute. Unlike Datalevin, it doesn't touch the disk. If the experiment had worked, I would have planned to store it in the transit db, since it can be serialized to EDN.

ericdallo commented 2 years ago

Closing this as it's probably fixed, more info here. Please let us know if this happens again, thank you all for the help!

mainej commented 2 years ago

Here's how Github has been calculating and caching symbol definitions and references.

The interesting part of the article is the information about how they pipeline work. It touches a bit on @snoe's ideas above about a task queue—embodied by what Github calls the "indexer". The indexer debounces, throttles and deduplicates work.

Less interesting is how they handle definitions and references. Though there are some relevant ideas, their constraints are a bit different.