SafeteeWoW / LibDeflate

Pure Lua compressor and decompressor with high compression ratio using DEFLATE/zlib format.
https://safeteewow.github.io/LibDeflate/
zlib License
85 stars 17 forks source link

Fastest way to detect a difference between 2 LibDeflate:EncodeForPrint strings? #7

Closed jmaldon1 closed 3 years ago

jmaldon1 commented 3 years ago

I have to check the difference between a bunch of encoded for print strings because I want to see if any of them have updated. I do this at the load of my addon, so preferably I'd like this to be as fast as possible to not make my addon a burden when people load up the game.

This is what I have so far. Based on what we know about the encoded string is there anything faster we can do here?

So far if nothing has updated, I am doing an O(N) comparison on every string for what feels like no reason.

local function isEncodedStringDifferent(stringA, stringB)
    if #stringA ~= #stringB then
        -- O(1) check if lengths are different.
        return true
    end

    -- Compare the characters of each string in reverse because the back of the encoded strings change the most.
    for i = #stringA, 1, -1 do
        local charA = string.sub(stringA, i, i)
        local charB = string.sub(stringB, i, i)
        if charA ~= charB then
            return true
        end
    end
    return false
end

Are we able to safely assume that no matter how small the change that was made to the data, the last character of the encoded string will definitely be different?

SafeteeWoW commented 3 years ago

I don't think there exists algorithm faster than O(N) EncodeForPrint maps every 3 bytes of original data into 4 printable characters. Or to be precise, the encoded byte length = ceil( 8* len(input) / 6)

The generation of encoded table is just a linear traversal of the original input. The last character of the encoded data is not special, and has no relationship to the earlier portion of the input.

SafeteeWoW commented 3 years ago

I would recommended to also store a hash value for the encoded string, in your addon's configuration file. You need to define a hash function.

When the encoded string is exported, also export its hash value along with it

jmaldon1 commented 3 years ago

Ok interesting, and what will the hash value provide for me?

And do you recommend any hashing algo?

SafeteeWoW commented 3 years ago

Also, compare Lua String simply by a == b

Lua uses the following C code to compare them, which is significantly faster than Lua calls

int luaS_eqlngstr (TString *a, TString *b) {
  size_t len = a->u.lnglen;
  lua_assert(a->tt == LUA_VLNGSTR && b->tt == LUA_VLNGSTR);
  return (a == b) ||  /* same instance or... */
    ((len == b->u.lnglen) &&  /* equal length and ... */
     (memcmp(getstr(a), getstr(b), len) == 0));  /* equal contents */
}
SafeteeWoW commented 3 years ago

Copy from Lua source code. You can use this hash function

unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) {
  unsigned int h = seed ^ cast_uint(l);
  for (; l > 0; l--)
    h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
  return h;
}

unsigned int luaS_hashlongstr (TString *ts) {
  lua_assert(ts->tt == LUA_VLNGSTR);
  if (ts->extra == 0) {  /* no hash? */
    size_t len = ts->u.lnglen;
    ts->hash = luaS_hash(getstr(ts), len, ts->hash);
    ts->extra = 1;  /* now it has its hash */
  }
  return ts->hash;
}

Make sure hash function is not called when initializing your addon.

SafeteeWoW commented 3 years ago

Or use LibDeflate:Adler32() as a hash function

jmaldon1 commented 3 years ago

Are you saying to hash the encoded string on export and then instead of comparing the encoded strings, we just compare the hashes? And if the hash function returns a fixed length output checking for changes would just technically be O(1)?

jmaldon1 commented 3 years ago

Attempting to format the result of the hash results in an overflow error: integer overflow attempting to store 246806894

with the following code:

local hash = LibDeflate:Adler32(printableEncode)
local d = string.format("0x%08X", hash)
SafeteeWoW commented 3 years ago

Adler32 may cause false positive due to its collision rate.

https://github.com/philanc/plc/ has some pure Lua implementation of sha1, md5, etc, which is better.

But its implementation is for Lua 5.3. You may need to modify it to make it work for Lua 5.1

jmaldon1 commented 3 years ago

I tried this CRC32 algo (https://github.com/openresty/lua-nginx-module/blob/master/t/lib/CRC32.lua) and this md5 algo (https://github.com/kikito/md5.lua/blob/master/md5.lua) but I get the same overflow error.

I am not too familiar with how WoW works with Lua but I assume there is something with WoW that is preventing this. Or maybe I am doing something wrong. But so far its not looking like printing the hex string of a hash is happening :/

SafeteeWoW commented 3 years ago

It seems to be a WoW specific issue. But I don't think the code string.format("0x%08X", hash) is required for you.

jmaldon1 commented 3 years ago

How do we get the result of these hash functions to be a fixed length hex so we can print it?

SafeteeWoW commented 3 years ago

Assume string.format does not exist. Write one for 32bit number from scratch.

jmaldon1 commented 3 years ago

https://stackoverflow.com/a/66819728/9506134

Found this code that converts numbers to hex, just needs to be modified to pad the hex to a fixed length. Otherwise, it seems to work great!

Thanks for the help @SafeteeWoW. I am not familiar with Lua at all and the fact that it acts differently under WoW is something to get used too.

If I end up modifying the code to pad the hex, I'll add it here later, otherwise we can close this.

EDIT: Turns out all you need in order to pad the result of that function is this: string.format("0x%08s", hex)