ProjectRegistryManager reads projects twice (but not consitently)

laeubi commented 2 years ago

With the recent refactoring and changes regarding container I noticed while implementing the MavenLifeCycleListener support, that we actually read the maven project twice:

in ProjectRegistryManager#readMavenProjectFacades
In ProjectRegistryManager#readProjectsWithDependencies

even worse, ProjectRegistryManager#create bypass the first and directly calls the later ... and then probably call the second two times...

The most notable difference seems to be that the first only wants to fetch basic information and the second to fetch all the dependencies (most likely for class-path setting and mojo execution).

To make this more consistent (and reduce the confusion between IMavenProjectFacade#getMavenProject() IMavenProjectFacade#getMavenProject(IProgressMonitor monitor)) it seems more suitable to have facades created only with the model and the project is loaded only once when required for the first time.

laeubi commented 2 years ago

ProjectRegistryManager.create(IFile, boolean, IProgressMonitor) even claims to look at the cache but if not found do not add it to the cache... seems strange. Because calling IMavenProjectFacade#getMavenProject on the facade then effectively adds it to the cache.. so actually there could be different facades around sharing the same maven model depending on what methods are called.

laeubi commented 2 years ago

Just some observations:

If i update the camel project this results in nothing cached at all and reading 631 poms into facades has taken 40sek. (that is about 60 ms / project) While updating the facades, the Maven projects are made available through the session context (and are cached there).
after that then m2e detects that something has changed (obviously) and asks for the maven project an then each project is loaded one by one and put into the cache (this takes between approximate 50 - 200 ms per project)
The first works quite well, but already publishes 5 Maven Projects while only one facade is in the cache
The things start to getting worse, we now have three facades in the cache (even though only 2 projects where read and there are two unique ones) and already twelve MavenProjects where only seven are unique.

I'll capture some more measures on the total time and the number of duplicates in the session context before processing further.

mickaelistria commented 2 years ago

Project import/update doesn't populate cache as the project loading is done incompletely IIRC and the goal is really and only to regenerate project facade and a build graph; and the MavenProject are then discarded after that. It's very likely that in many cases, we could re-inject the MavenProject in the cache, but I believe it's not something we can always do.

after that then m2e detects that something has changed (obviously) and asks for the maven project an then each project is loaded one by one and put into the cache (this takes between approximate 50 - 200 ms per project)

Indeed, loading 1 by 1 is suboptimal. It would probably be worth grouping those that have same project build configuration (same profiles, same updates settings...) so they're built together, and then pick a few good ones to populate the cache.

The first works quite well, but already publishes 5 Maven Projects while only one facade is in the cache

Yes, loading a MavenProject leads to parent also being loaded. Currently the project registry ignores that and will re-request the loading of a parent project if even if the cache inderictly contained it, but it could instead look at all MavenProject availables, including parents.

The things start to getting worse, we now have three facades in the cache (even though only 2 projects where read and there are two unique ones) and already twelve MavenProjects where only seven are unique.

Yes.

Keep in mind there are several distinct possible improvements, both being IMO extremely valuable.

Improve how we query cache: do not look only at root projects, also lookup the MavenProject from the hierarchy to make them available without reloading them.
Improve how multiple projects are built together to get their parents factorized and minimize duplication
Improve how we populate the cache: pick a set of projects that do cover the most MavenFacade.
Maybe others...

laeubi commented 2 years ago

Just one observation about the facades:

Currently we have a cache Cache<MavenProjectFacade, MavenProject> but MavenProjectFacade do not implements equals/hashcode so we have Object identity. There is only one method flushCaches that actually removes facades from the cache, so this might explain that we see duplicates here...

laeubi commented 2 years ago

Project import/update doesn't populate cache as the project loading is done incompletely IIRC and the goal is really and only to regenerate project facade and a build graph; and the MavenProject are then discarded after that. It's very likely that in many cases, we could re-inject the MavenProject in the cache, but I believe it's not something we can always do.

I have now done the measures (with apache-camel project) reading 573 poms without dependencies takes roughly 50 seconds while reading it with all dependencies takes roughly 55 second. So not really much of a benefit regarding time, compared to that we need the same (if not even more!) time to load it afterwards again, this seems not valuable.

If we really want to take it serious (only read the basics and forget about it afterward), I think it would be much more suitable to only read the model instead of the project into a facade, and for the execution using a full project (what might be cached then) and there are only a few properties that are not already from the model. Especially for the use-case "we need to know what artifacts are in the workspace", even the very basic GAV would be suitable.

laeubi commented 2 years ago

Some more observations

In the bulk read there seem only a very few duplicate items, so the parent reading is cached there already by maven
in "phase 2: resolve project dependencies" a new facade is created put to the cache but I can't see removal of the old one, so probably this is the cause for multiple facades in the cache, also when there is a new facade, later on then this facade is used to execute what then triggers a load of the maven project of the old facade, because it is looked up by the pom (createExecutionContext(newState, pom) but this of course is a different cache and thus triggers the loading of a maven project again. The code looks a bit strange, because why one should lookup the facade by the pom if the facade itself is already known?

mickaelistria commented 2 years ago

MavenProjectFacade do not implements equals/hashcode so we have Object identity.

That can be a good issue to chase and easy enough to fix.

I have now done the measures (with apache-camel project) reading 573 poms without dependencies takes roughly 50 seconds while reading it with all dependencies takes roughly 55 second. So not really much of a benefit regarding time, compared to that we need the same (if not even more!) time to load it afterwards again, this seems not valuable.

Doesn't it mostly depend on the content of your Maven repo? What if most deps are missing in the Maven repository; wouldn't that make import much longer (and IDE is unusable during import)? But it's not really all: some project information are unfortunately not accessible directly to Maven, there are some m2e project settings at play that influence the outcome of a project build/resolution and those need to be taken into account when doing project resolution in the workspace, while those can be ignored during import.

If we really want to take it serious (only read the basics and forget about it afterward), I think it would be much more suitable to only read the model instead of the project into a facade,

We need more than the GAV from the MavenProjectFacade. Some settings in MavenProjectFacade require a full resolution (eg things like resource or class directories) as they can be inherited.

laeubi commented 2 years ago

That can be a good issue to chase and easy enough to fix.

I'm not quite sure it would work that easy, the code seem to rely on creating a new facade is reloading "something", thats all quite confusing and sometimes I wonder it is working at all :-)

Doesn't it mostly depend on the content of your Maven repo? What if most deps are missing in the Maven repository; wouldn't that make import much longer (and IDE is unusable during import)?

The point is that if you have a project and import it, to resolve the classpath sooner or later you will need them anyways.... at the moment it is just after import triggering the download anyways, but I'm still playing around with it.

Sad enough I now again run into an endless loop cycle after it has worked half a day ... so first try to debug how to recover from that :-(

mickaelistria commented 2 years ago

The point is that if you have a project and import it, to resolve the classpath sooner or later you will need them anyways.... at the moment it is just after import triggering the download anyways, but I'm still playing around with it.

Yes, but choosing when to cause a delay to user is the key decision here. Blocking the import for a long time is more annoying to users than having projects progressively configured in background with good progress reporting while still already allowing users to read and edit files in the meantime.

laeubi commented 2 years ago

The point is that if you have a project and import it, to resolve the classpath sooner or later you will need them anyways.... at the moment it is just after import triggering the download anyways, but I'm still playing around with it.

Yes, but choosing when to cause a delay to user is the key decision here. Blocking the import for a long time is more annoying to users than having projects progressively configured in background with good progress reporting while still already allowing users to read and edit files in the meantime.

As far as I can see all of this already happens in the background. beside I myself wont mind a little longer delay when importing the project if it then works faster when working with the project.

mickaelistria commented 2 years ago

As far as I can see all of this already happens in the background. beside I myself wont mind a little longer delay when importing the project if it then works faster when working with the project.

Please read from former bug link I pasted. People want import to be fast even if incomplete; more than long and complete. So keeping the import fast is more or less a functional requirement from actual users.

laeubi commented 2 years ago

Please read from former bug link I pasted.

Here or in another discussion? I can't find one so I probably missed it.

People want import to be fast even if incomplete; more than long and complete

The import is (and was) always fast because it only reads the model what is slow (and annoying) is what happens afterwards, because it takes literally ages for large projects until they are usable, e.g. you don't have background jobs running, not compile errors all over the place and scary errors/warnings in the pom and project.

Also I open my IDE to work with it so there it more counts that my changes are applied fast and I get immediate feedback of real problems., if I want to take a quick look at a file I usually can use a text editor much better.

mickaelistria commented 2 years ago

The reference issue about the import/refresh story blocking the IDE is https://bugs.eclipse.org/bugs/show_bug.cgi?id=515668 . Indeed, it's the 2nd step of import, but still the issue was really preventing people from even loading the project in the IDE. They had time to download another IDE and import the project in it and read a tutorial before m2e had the IDE unlocked. This is what we strongly need to avoid.

if I want to take a quick look at a file I usually can use a text editor much better.

The story is people start importing a Maven project, a huge one, and start to look at it; browsing code, reading it.. They may not need full IDE features for a few minutes until they identified where to code (and what). Full and expensive project resolution is not immediately useful to end-user. m2e leverages these few minutes to do the heavy work while users can already just navigate the project to get a bit familiar with it. No one just imports a project and codes immediately; most of the time in an IDE is spent browsing and reading code and this important task can be achieved with a mininal import; while heavier work happens in background.

laeubi commented 2 years ago

Alright, but that's exactly the case I'm currently try to improve here :-)

UI freeze reported there for example is already fixed (was a bug in eclipse job API) and the deduplication of projects already has reduced memory footprint a lot. Still with such large projects the default cache size with 20 projects is much to low, so currently I work on having a more dynamic cache here but this will require loading dependencies, or we will get wrong results, but I'm already thinking about if we might can use unresolved projects and do the "real" resolve (that is downloading the artifact and maybe discover even more dependencies) later on.

laeubi commented 2 years ago

I just made an experiment (with my optimized project cache and maxcache size of 1500 projects):

Empty Workspace
Selecting camel root with "Import existing Maven project" --> just a few second and the full tree to import is shown
Select all projects and press finish
Then a background job is started saying "importing maven projects but I instantly see a project tree I can browse the tree and some projects are incrementally appear
After a few minutes (maybe 2 -3) everything is settled and I see the projects getting build and errors/warnings are show up, this is probably the most time consuming part

So I think what might be optimized here is that we create the project with a very very limited set of data first (so that all projects show up at once) and then the user might not notice much about the background work if one just want to "see" the files...

3.5 Gb of ram are used, mostly with byte and char and string ... strange enough I see a lot of org.eclipse.wst.xml.core.internal.text.XMLStructuredDocumentRegion with 6 million (!) instance which are seem to be hold by the lsp4e connectdocument. Please note that not a single editor is open! These alone occupies more than 800MB of heap memory and the org.eclipse.lsp4e.ConnectDocumentToLanguageServerSetupParticipant$2 retains a heap size of 1.8 GB!

Compared to that, all Maven project (1126 Instances most probably due to my dynamic cache and the duplication of facades I have not fixed yet, should actually be around 815) retain a size of 500 MB ...

mickaelistria commented 2 years ago

3.5 Gb of ram are used, mostly with byte and char and string ...

Are the 3.5GB retained or are they temporary consumption. If the former then I think it's too much: coupled with other expensive IDE operations, it can easily reach Heap limit and trigger a lot of GC and drastically slow down the whole IDE. That's why the cache size is to take seriously. Currently with 20 projects and biggest projects retaining 3MB of heap, it means we allow storing ~60MB in MavenProjects; it's indeed low. This could IMO be grown to ~512MB relatively safely (that would be about 200 projects) without being harmful; but more introduce a risk of being too expensive. I think this should ideally be configurable via some preference (not necessarily shown in UI but at least something embedders of m2e could configure in their product_customization.ini).

I see a lot of org.eclipse.wst.xml.core.internal.text.XMLStructuredDocumentRegion with 6 million (!) instance which are seem to be hold by the lsp4e connectdocument. Please note that not a single editor is open! These alone occupies more than 800MB of heap memory and the org.eclipse.lsp4e.ConnectDocumentToLanguageServerSetupParticipant$2 retains a heap size of 1.8 GB!

That's indeed a bug. Can you please report it in a dedicated ticket?

Compared to that, all Maven project (1126 Instances most probably due to my dynamic cache and the duplication of facades I have not fixed yet, should actually be around 815) retain a size of 500 MB ...

Great!

laeubi commented 2 years ago

That's indeed a bug. Can you please report it in a dedicated ticket?

I not really have a clue what this actually does and why all the objects are actually created at all...

laeubi commented 2 years ago

Actually there is already AbstractMavenDependencyResolver.resolveProjectDependencies(IMavenProjectFacade, Set<Capability>, Set<RequiredCapability>, IProgressMonitor) that is called in "phase 2" with the existing maven project, so question is why we not always read project without dependencies and only resolve them on demand then?

laeubi commented 2 years ago

Just another round of observations:

I was wondering why I always had one maven project in the cache more than pom modules in the workspace. The reason seems obvious but I like to note it here because there is of course also a potential caching opportunity: If a pom references a parent that is not in the workspace (e.g. read from the local maven repository) it is still loaded and thus cached indirectly.
I have now written a more advanced cache that do not hold references to the maven facades and there are no longer duplicates, but now maybe a bit "too good" as I can't see that m2e flush cached MavenProjects from the project manager but relies a bit on "phase 2" creating a new facade but I would have expected that the cache is flushed explicitly as the context project delegate to the cache itself again...

laeubi commented 2 years ago

I have now also created https://issues.apache.org/jira/browse/MNG-7591

laeubi commented 2 years ago

Just for the record, I enabled string deduplication as suggested by @HannesWell and it cuts down retained size of the full set of camel projects (575) from 250 MB > 200 MB. But even 250MB is fir enough that is about 500kb / project now, I have extracted from my experiments now an improved project cache in this PR:

https://github.com/eclipse-m2e/m2e-core/pull/1045

eclipse-m2e / m2e-core

ProjectRegistryManager reads projects twice (but not consitently) #1030