eclipse-aspectj / aspectj

Other
272 stars 82 forks source link

Fix LTW with a parallel-capable class loader and generated classes #278

Closed urisimchoni closed 4 months ago

urisimchoni commented 5 months ago

This PR fixes the situation where AspectJ LTW Java agent is invoked by the instrumentation interface simultaneously in two threads, for the same class in the same class loader, and the weaving generates an inner class.

Parallel-capable class loaders have been around since Java 7, and they might be loading the same class twice in two threads.

AspectJ LTW weaver sometimes generates classes to handle the closure of around advices. With the same class being loaded twice, AspectJ wants to avoid generating the class twice. It consults a generateClasses map to see if a class by the same name with associated generated classes has already been woven by this weaver.

However, if a match is found in the map, AspectJ would return the original bytes, and that causes the thread that "lost the race" to use an unwoven class. This was discovered with WildFly when defining multiple DataSources with the same driver - it looks like WildFly is using some parallel executor to create the DataSource instances, and when trying to weave the DataSource it would sometimes not work.

The fix here returns the woven bytes of the previous weaving operation.

The PR also includes a simplification of the code, of not storing the names of generated classes in the generateClasses map, only names of woven classes with associated generated classes. This was done when I realized that this map is never consulted for the generated class key, only for the woven class name.

kriegaex commented 5 months ago

Hi @urisimchoni. Thanks for this PR. Your explanation sounds reasonable. I hope I can look into this in more detail soon. It would be nice to have one or more regression tests reproducing the problem without this patch. A best effort reproducer would also help already. I can take care of incorporating it into the AspectJ test suite with its strange XML configuration mechanics, if that is what stops you from providing a reproducer.

urisimchoni commented 5 months ago

Thanks for looking into this @kriegaex. I compiled a reproducer here - https://gitlab.com/urisimchoni/aspectj-parallel-issue

If you can point me to a close-enough test I can have a go at it with adapting it for AspectJ test suite. Because this is fundamentally a race condition, the reproduce doesn't reproduce 100% of the time, but I've seen it happen after a few runs (less than 10), whereas with the fix it does not reproduce in 1000 runs.

One thing I discovered while putting this together was that the Java documentation WRT parallel class loading (https://docs.oracle.com/javase/8/docs/technotes/guides/lang/cl-mt.html) is a bit vague about whether it's possible to be loading the same class into the same class loader in two threads simultaneously. One paragraph says "...a class by a parallel capable class loader now synchronizes on the pair consisting of the class loader and the class name", Whereas another paragraph is saying that when constructing a parallel class loader one should "Decide upon an internal locking scheme", without specifying what this scheme is.

Obviously, if parallel loading of same class was forbidden, there would not be an issue, but the WildFly loader seems to avoid locking altogether (instead it defines classes and catches the LinkageError exception that occurs if the class already exists), which brings us to this issue.

urisimchoni commented 4 months ago

I opened issue #279 so that we can associate the test code with an issue.

kriegaex commented 4 months ago

@urisimchoni, sorry for the inconvenience, but practically simultaneously with your PR, another issue came in, #277, which I just fixed with #280. Unfortunately, I touched the same classes your PR does. Even if it merges without conflicts, I kindly request you to rebase this one and verify that your change still does what it is supposed to in the new configuration.

FYI, my change was about returning null from the weaver when it is sure that there were no changes. This is according to the contract of java.lang.instrument.ClassFileTransformer::transform and, while not strictly necessary, helps to avoid certain CDS problems and might result in an instrumentation speed-up in general (untested), possibly avoiding unnecessary redefinitions (or retransformations, if we use the latter in the future).

Thanks in advance. Your change is next in line, if I have free cycles next time.

urisimchoni commented 4 months ago

I've rebased the PR. According to my analysis, the recent change is not affected-by and does not affect this PR:

I also re-tested with the reproducer.

Regarding turning this reproducer to a test - maybe I'll have cycles for this in a few days, but a few pointers might come in handy:

kriegaex commented 4 months ago

Thanks for the reproducer. I already converted a modified version of it with renamed classes and a loop doing the parallel weaving several times into an integration test which reliably reproduces the problem before your fix.

I then rebased your (slightly reworded) commits on the commit with the test and verified, that the problem is really fixed by them. I also reviewed your code, but did not fully immerse myself into it due to missing cycles. I hope not to regret that later, committing something I do not fully understand, but the build is green, and so I am confident that nothing important is broken.

As soon as the CI build passes, I will merge.

kriegaex commented 4 months ago

Thank you so much for your exemplary PR and the very nice test template, which I could convert into an IT with reasonable effort. Thanks for helping to make AspectJ better.

urisimchoni commented 4 months ago

Thank you for landing this so quickly and for turning the reproducer into a test - realizing that you can loop instead of forking a new JVM per iteration because you can create a new class loader instance per iteration was brilliant!

kriegaex commented 2 months ago

@urisimchoni, OSGI weaving in Eclipse is broken in AJDT due to this PR. I was looking for the root cause for days and thought it was to be found elsewhere in other changes I made in the last few months. Just now, I experimentally reverted this PR, and suddenly a weaving test in AJDT that was failing before is passing again. I have yet to analyse why that is, but maybe you can enlighten me. Are you an AJDT user, or do you work with an IDE other than Eclipse?

Correction: For some reason, the revert fixes the test, but not the whole problem when using OSGi weaving in isolation. Maybe, it was a mix of your changes and mine, or maybe it is about changes in Eclipse JDT or Equinox. It is quite difficult to debug, basically I am using trial and error, because I know next to nothing about Eclipse.

urisimchoni commented 2 months ago

@kriegaex Alas, I'm not an AJDT user. If there's some way for me to reproduce I can have a look - I didn't spot any recent issue. Perhaps my change broke some reflective access or generated-code access to the generateClasses map? I did change what's stored in that map, based on the premise that current code (before change) is only concerned with keys of that map. I recall that this struck me as odd that none is looking at the map values (why not use a set), but I figured this must be some left-over or preparation for something that never materialized. Still, it's hard to believe that this map is accessed directly from generated code without some accessor function.

kriegaex commented 2 months ago

@urisimchoni, it is accessed from child class OSGiWeavingAdaptor, hence the protected field. I inherited all that stuff, I did not implement it myself. See also https://github.com/eclipse-aspectj/ajdt/issues/57.