Closed glassfishrobot closed 13 years ago
@glassfishrobot Commented kcavanaugh said: I think you mean SynchronizedHolder.content. Is making the data member _content volatile significantly faster than a lock in this case? That probably still gives adequate semantics in this case. Is this fix needed for 3.1, or can it be delayed until 3.2?
@glassfishrobot Commented sdo said: I did mean SynchronizedHolder.content().
So I assume from the class type of the content variable that this is doing method-level tracing or monitoring? Even though we haven't turned that on? Part of the reason behind my question is that I'm not sure that the content() method itself is the issue or if the JDK has inlined enough things so that content().enter() is part of the bottleneck. But I assume not; I assume that "enter" refers to method tracing rather than monitor (meaning lock) acquisition.
I am not sure about volatile in this context; in general when locking around a getter method is a problem, the issue is the flushing of registers and such that are required by the Java memory model, and those semantics are the same for volatile. Though one difference might be that synchronized must make all references consistent and volatile only that particular reference, so there is room for improvement with volatile; I just don't know how smart the JVM is in that case. Barring that, could the monitor object be gotten once at the beginning of the request rather than for each little operation down the line? If the monitoring object is changed mid-request, you'd lose some info for that request, I guess, but it seems like that wouldn't be a big deal.
If I could turn of the monitoring somehow, I could give you a better answer as to the timing of the fix. Right now, this is the only difference in 3.0.1 and 3.1 I see on the server side to explain our trade2 regressions, so I expect it will be deemed significant enough to require a fix in 3.1. On the client side it is definitely affecting specj as well, but until there is a fix for 15391 we are unlikely to notice it there.
@glassfishrobot Commented kcavanaugh said: Monitoring is off. The instrumented code is just doing a SynchronizedHolder.content(), storing the value (normally null) in a local, and checking the local value. However, it seems that the synchronization is more expensive than I expected in this. It is not possible to do this per-request, because the tracing code works per-class, and many classes are involved in a request.
Is the problem lock contention, or simply the number of lock calls? If it is lock contention, I could try a read/write lock here, which would at least reduce contention, but may not help if the problem is just the number of lock calls.
One complete fix for this would be to instrument the code at runtime. I've designed the tracing facility with that in mind, so that part of the bytecode transformation (changing signatures of methods, creating or extending the static initializer) must be done at build time, but the other part (which includes the SynchronizedHolder call) could be done only when tracing is enabled. I'm reluctant to introduce that change this late in 3.1, but it may be a good candidate for 3.2.
Another possibility is to simply remove the synchronization completely for 3.1, since normally we only set tracing by -Dcom.sun.corba.ee.Debug=XXX, and so we don't rely on dynamically updating very much, although it is possible through an ORB MBean.
So, we could try any of three approaches:
I'm leaning in the direction of removing synchronization completely, and moving to the runtime instrumentation solution for 3.2. Let me know which patch you want to try, and I'll put it together on a b019 ORB on 1/3/11.
@glassfishrobot Commented sdo said: I think we should test a patch with the synchronization removed and see the effect. Then we can evaluate the better course once we know the impact, but if the monitor isn't going to be used dynamically, then that sounds like a good choice.
@glassfishrobot Commented kcavanaugh said: Performance regression should be fixed in ORB b021, integrated in GF rev 44286.
@glassfishrobot Commented Issue-Links: blocks GLASSFISH-13573 GLASSFISH-13944
@glassfishrobot Commented Was assigned to kcavanaugh
@glassfishrobot Commented This issue was imported from java.net JIRA GLASSFISH-15392
@glassfishrobot Commented Reported by sdo
@glassfishrobot Commented Marked as fixed on Thursday, January 6th 2011, 9:31:36 am
The performance of CDRInputStream has significantly regressed, affected client-side and server-side performance. It is more visible on the client-side, where there isn't so much other logic going on to mask the issues.
Reading 5K simple requests takes 2X longer on 3.1 than 3.0.1 in CDRInputStream_1_0.readOctet(). This appears to be due to new locking semantics; in 3.0.1 the major component of reading is simply getting the data (HeapByteBuffer.get()), while in 3.1 we spend a large amount of time (half the total time, or essentially all the regression) waiting for locks around SynchronizedContent.content(). This only gets worse as we scale to more users on larger hardware.