Closed XmiliaH closed 1 month ago
Some notes for the current implementation.
The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.
It would also be possible to move the lookup table initialization into the chunk. This would remove like two instructions from the switch for the initialization test. However, this would iniliatize all switches, even the ones never used.
Rebased this branch because we treat Git as a tree; not a directed graph.
The switchimpl function is now incredibly long, so I'm assuming you're taking ownership and I can ping you if there's any issues with it? :P
The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.
Could you elaborate on this some more? Would this PR cause issues with the upgrade to Lua 5.5?
You can ping me when there are issues with the switch implementation.
With not stable I mean that string.dump will give out different results on different runs for functions with switch statements due to different order of the table initialization. I use a lua table to during the compiling to gather all the constants, but a lua table traversal order is not defined, so can result in different orderings.
I see. That does sound a bit ugly, but I also can't imagine a better solution than putting a std::stack<std::map<...>> on LexState.
I could use a std::vector
in the switchimpl
function, but that would just be there for the ordering. So if a stable output is important, this should be simple to add.
You cannot stack-allocate such a thing because RAII is not guaranteed here.
You are right, then I need to look into case_pc
too as it does this currently.
Just tested this extensively against our integrated Pluto environment, and no obvious crashes or bugs at least.
Tho, it seems to actually decrease performance in a contrived test like this:
local start = os.clock()
local cc = 0
for i = 1, 100000000 do
local value = 3
switch value do
case 1:
case 2:
case 3:
cc = cc + 1
break
case 4:
case 5:
break
end
end
print(os.clock() - start)
return print(cc)
Yes, with the table lookup, check for initialization, and other overhead the binary search will only be used from around 16 cases upwards. Try with the switch speed test:
local N = 2000
local code = "return function(val)\nswitch val do"
for i=1, N do
code = code .. "\ncase " .. i .. ".5: return " .. i
end
code = code .. "\ndefault: return -1\nend\nend"
local f = load(code)()
-- Initialize, this might take more time
assert(f(0) == -1)
for i=1, N do
assert(f(i+0.5) == i)
end
assert(f(0) == -1)
The problem is that I think >95% of switch blocks are gonna have <16 cases.
And are the other stack allocated std::vector
in the parser then not also buggy?
That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.
The problem is that I think >95% of switch blocks are gonna have <16 cases.
Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.
int average_time_linear = num_consts/2;
int average_time_table = luaO_ceillog2((unsigned int)case_pc.size()) + 4; /* Load and check cache, load value, check nil case, then do the binary search */
if (average_time_linear <= average_time_table) {
/* Not worth the effort, just do a linear scan */
That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.
Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.
Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.
I'm not asking for you to pull a miracle and make them better, but it's bad when this change makes them worse.
Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.
We do use exceptions instead of longjumps by default, but it is still an option, and the goal is not to leak memory in this case. It's possible some older code was unaware of this pitfall and unfortunately will still have this issue.
I fixed the memory leak issues with std::vector
in the switch statement, need to check std::function
.
Will also look in the speed decrease of the example you mentioned.
I would advise just using a void*
for user-data and casting it as before instead of std::function
.
Also nothing wrong with std::vector
, as long as it's somewhere on LexState, because that does actually get cleaned up in all cases.
When putting the vector on LexState it would need to handle reentrancy, And the tables also use the correct allocation function from the Lua state.
I guess the speed decrease for the simple case is due to the switch always starting with a jump and not with a conditional jump like before. This results in one instruction more per loop. I will look to emit the first case in place and jump on false to the other logic.
Still a pessimisation, which really shouldn't happen since you seem to have logic to "not bother" for small switches? Maybe this patch would've been better if it focused on trying to add the new functionality without effectively rewriting the whole thing.
When just looking at
local value = 3
switch value do
case 1:
case 2:
case 3:
cc = cc + 1
break
case 4:
case 5:
break
end
I get the same code from main & this branch now. So I have no clue where the slowdown should come from.
You are able to reproduce it, tho, right?
No, I am not able to reproduce the slowdown. Over multiple runs sometimes the main branch is faster, sometimes this one.
Well, it seems to only be on MSVC, and I guess "within margin of error," but now I've tried a benchmark where your implementation should undoubtably be faster, and again it's the same speed at best:
local start = os.clock()
local cc = 0
for i = 1, 100000000 do
local value = 20
switch value do
case 1: break
case 2: break
case 3: break
case 4: break
case 5: break
case 6: break
case 7: break
case 8: break
case 9: break
case 10: break
case 11: break
case 12: break
case 13: break
case 14: break
case 15: break
case 16: break
case 17: break
case 18: break
case 19: break
case 20: cc = cc + 1 break
end
end
print(os.clock() - start)
return print(cc)
main
improve-switch
Or am I misunderstanding what your optimisation is supposed to do?
With the 20 cases both average_time_linear
and average_time_table
are both 9, so it will still use the linear lookup. Either add a case 21 or remove the break
from all cases but 19.
Okay, now we have ~3.5 on main, and ~2.4 on this branch (using Clang). That is cool, but...
I don't know why the same bytecode and only changes to the parser make the interpreter slower on MSVC. Maybe something gets aligned differenty which reduces the speed. But that could happen with any change.
As for the complexity vs benefit is something you need to decide. I have no problem with closing this. I had my fun implementing this and have no problem when I do not need to maintain in and further.
This is a very impressive optimization, however I share some of the concerns already raised in this thread. The most concerning point goes to maintainability of the switch statement. The complexity increase is great enough to say that the lessons learned over time with our switch statement — the bugs fixed, the time-tested factor, the overall polish applied — would be reset. While the performance increase is quite good, the scope of when that performance increase will apply is too limited to justify the aforementioned costs.
First version of an improvement to the switch statement to change the time complexity from O(n) to O(log n) (see https://github.com/PlutoLang/Pluto/issues/860).
This does currently not pass the debug tests because of the extra code generated in the main chunk.