Improve switch time complexity for constant cases

XmiliaH commented 1 month ago

First version of an improvement to the switch statement to change the time complexity from O(n) to O(log n) (see https://github.com/PlutoLang/Pluto/issues/860).

This does currently not pass the debug tests because of the extra code generated in the main chunk.

XmiliaH commented 1 month ago

Some notes for the current implementation.

The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.

It would also be possible to move the lookup table initialization into the chunk. This would remove like two instructions from the switch for the initialization test. However, this would iniliatize all switches, even the ones never used.

Sainan commented 1 month ago

Rebased this branch because we treat Git as a tree; not a directed graph.

The switchimpl function is now incredibly long, so I'm assuming you're taking ownership and I can ping you if there's any issues with it? :P

The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.

Could you elaborate on this some more? Would this PR cause issues with the upgrade to Lua 5.5?

XmiliaH commented 1 month ago

You can ping me when there are issues with the switch implementation.

With not stable I mean that string.dump will give out different results on different runs for functions with switch statements due to different order of the table initialization. I use a lua table to during the compiling to gather all the constants, but a lua table traversal order is not defined, so can result in different orderings.

Sainan commented 1 month ago

I see. That does sound a bit ugly, but I also can't imagine a better solution than putting a std::stack<std::map<...>> on LexState.

XmiliaH commented 1 month ago

I could use a std::vector in the switchimpl function, but that would just be there for the ordering. So if a stable output is important, this should be simple to add.

Sainan commented 1 month ago

You cannot stack-allocate such a thing because RAII is not guaranteed here.

XmiliaH commented 1 month ago

You are right, then I need to look into case_pc too as it does this currently.

Sainan commented 1 month ago

Just tested this extensively against our integrated Pluto environment, and no obvious crashes or bugs at least.

Tho, it seems to actually decrease performance in a contrived test like this:

local start = os.clock()
local cc = 0
for i = 1, 100000000 do
  local value = 3
  switch value do
    case 1:
    case 2:
    case 3:
    cc = cc + 1
    break
    case 4:
    case 5:
    break
  end
end
print(os.clock() - start)
return print(cc)

XmiliaH commented 1 month ago

Yes, with the table lookup, check for initialization, and other overhead the binary search will only be used from around 16 cases upwards. Try with the switch speed test:

local N = 2000
local code = "return function(val)\nswitch val do"
for i=1, N do
    code = code .. "\ncase " .. i .. ".5: return " .. i
end
code = code .. "\ndefault: return -1\nend\nend"
local f = load(code)()
-- Initialize, this might take more time
assert(f(0) == -1)
for i=1, N do
    assert(f(i+0.5) == i)
end
assert(f(0) == -1)

Sainan commented 1 month ago

The problem is that I think >95% of switch blocks are gonna have <16 cases.

XmiliaH commented 1 month ago

And are the other stack allocated std::vector in the parser then not also buggy?

Sainan commented 1 month ago

That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.

XmiliaH commented 1 month ago

The problem is that I think >95% of switch blocks are gonna have <16 cases.

Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.

int average_time_linear = num_consts/2;
int average_time_table = luaO_ceillog2((unsigned int)case_pc.size()) + 4; /* Load and check cache, load value, check nil case, then do the binary search */
if (average_time_linear <= average_time_table) {
   /* Not worth the effort, just do a linear scan */

That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.

Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.

Sainan commented 1 month ago

Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.

I'm not asking for you to pull a miracle and make them better, but it's bad when this change makes them worse.

Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.

We do use exceptions instead of longjumps by default, but it is still an option, and the goal is not to leak memory in this case. It's possible some older code was unaware of this pitfall and unfortunately will still have this issue.

XmiliaH commented 1 month ago

I fixed the memory leak issues with std::vector in the switch statement, need to check std::function. Will also look in the speed decrease of the example you mentioned.

Sainan commented 1 month ago

I would advise just using a void* for user-data and casting it as before instead of std::function.

Also nothing wrong with std::vector, as long as it's somewhere on LexState, because that does actually get cleaned up in all cases.

XmiliaH commented 1 month ago

When putting the vector on LexState it would need to handle reentrancy, And the tables also use the correct allocation function from the Lua state.

XmiliaH commented 1 month ago

I guess the speed decrease for the simple case is due to the switch always starting with a jump and not with a conditional jump like before. This results in one instruction more per loop. I will look to emit the first case in place and jump on false to the other logic.

Sainan commented 1 month ago

Still a pessimisation, which really shouldn't happen since you seem to have logic to "not bother" for small switches? Maybe this patch would've been better if it focused on trying to add the new functionality without effectively rewriting the whole thing.

XmiliaH commented 1 month ago

When just looking at

local value = 3
switch value do
  case 1:
  case 2:
  case 3:
  cc = cc + 1
  break
  case 4:
  case 5:
  break
end

I get the same code from main & this branch now. So I have no clue where the slowdown should come from.

Sainan commented 1 month ago

You are able to reproduce it, tho, right?

XmiliaH commented 1 month ago

No, I am not able to reproduce the slowdown. Over multiple runs sometimes the main branch is faster, sometimes this one.

Sainan commented 1 month ago

Well, it seems to only be on MSVC, and I guess "within margin of error," but now I've tried a benchmark where your implementation should undoubtably be faster, and again it's the same speed at best:

local start = os.clock()
local cc = 0
for i = 1, 100000000 do
  local value = 20
  switch value do
    case 1: break
    case 2: break
    case 3: break
    case 4: break
    case 5: break
    case 6: break
    case 7: break
    case 8: break
    case 9: break
    case 10: break
    case 11: break
    case 12: break
    case 13: break
    case 14: break
    case 15: break
    case 16: break
    case 17: break
    case 18: break
    case 19: break
    case 20: cc = cc + 1 break
  end
end
print(os.clock() - start)
return print(cc)

main
- MSVC: ~3.4
- Clang: ~3.4
improve-switch
- MSVC: ~3.6
- Clang: ~3.4

Or am I misunderstanding what your optimisation is supposed to do?

XmiliaH commented 1 month ago

With the 20 cases both average_time_linear and average_time_table are both 9, so it will still use the linear lookup. Either add a case 21 or remove the break from all cases but 19.

Sainan commented 1 month ago

Okay, now we have ~3.5 on main, and ~2.4 on this branch (using Clang). That is cool, but...

It's an optimisation targeted at a very small minority of switch blocks.
That is slower on MSVC in the remaining (overwhelming majority) of switch blocks.
At the cost of severely increasing the code complexity of the parser's switch functions.

XmiliaH commented 1 month ago

I don't know why the same bytecode and only changes to the parser make the interpreter slower on MSVC. Maybe something gets aligned differenty which reduces the speed. But that could happen with any change.

As for the complexity vs benefit is something you need to decide. I have no problem with closing this. I had my fun implementing this and have no problem when I do not need to maintain in and further.

well-in-that-case commented 1 month ago

This is a very impressive optimization, however I share some of the concerns already raised in this thread. The most concerning point goes to maintainability of the switch statement. The complexity increase is great enough to say that the lessons learned over time with our switch statement — the bugs fixed, the time-tested factor, the overall polish applied — would be reset. While the performance increase is quite good, the scope of when that performance increase will apply is too limited to justify the aforementioned costs.

PlutoLang / Pluto

Improve switch time complexity for constant cases #862