explosion / preshed

💥 Cython hash tables that assume keys are pre-hashed
MIT License
82 stars 19 forks source link

PreshMap hangs on small numbers of paired insertions and deletions #21

Closed adrianeboyd closed 4 years ago

adrianeboyd commented 4 years ago

The first test hangs on the getitem in the assert. The second test is fine. The first test fails as soon as the range is at least 10. (I don't know if other small numbers of paired insertions and deletions cause this problem, like adding and removing two at a time, but adding one and removing that same item repeatedly is what needs to happen in the tokenizer cache when special cases are being added.)

My first thought was that it might be a resizing bug, but a quick test (that might have been flawed) made it look like it wasn't due to resizing, so I don't know what's going on?

Test cases written as new pytest tests:

from ..maps import PreshMap

def test_one_and_empty():
    table = PreshMap()
    for i in range(10):
        table[i] = i
        del table[i]
    assert table[0] == None

def test_many_and_empty():
    table = PreshMap()
    for i in range(10):
        table[i] = i
    for i in range(10):
        del table[i]
    assert table[0] == None
honnibal commented 4 years ago

I bet this is a bug with 0 or 1 as a key or value. @svlandeg did you observe something like that in your usage?

Edit: Hmm, this seems wrong? https://github.com/explosion/preshed/blob/master/preshed/maps.pyx#L112

I think we're missing an else here. If the key is one of the special values (0 or 1), we should be setting it into the reserved slots, and not continuing with the rest of the function. I think this is right:

cdef void map_set(Pool mem, MapStruct* map_, key_t key, void* value) except *:
    cdef Cell* cell
    if key == EMPTY_KEY:
        map_.value_for_empty_key = value
        map_.is_empty_key_set = True
    elif key == DELETED_KEY:
        map_.value_for_del_key = value
        map_.is_del_key_set = True
    else:
        cell = _find_cell(map_.cells, map_.length, key)
        if cell.key == EMPTY_KEY:
            cell.key = key
            map_.filled += 1
        cell.value = value
        if (map_.filled + 1) * 5 >= (map_.length * 3):
            _resize(mem, map_)
honnibal commented 4 years ago

Working on this.

svlandeg commented 4 years ago

I haven't run into this specific issue, no, because in my use-cases I don't often delete stuff from the map. But I certainly did run into issues when forgetting I can't index at position 0 ;-)

honnibal commented 4 years ago

You should be able to index at position 0 though, I was always confused by that --- so yeah that was definitely a long-standing bug. Fixed now.

svlandeg commented 4 years ago

Sorry I didn't write that correctly, I was thinking about the fact that 0 couldn't be a value (I use the preshmap values as indices elsewhere, which meant in the other object I couldn't index at 0) - i.e. Issue #17. So that's fixed now?

honnibal commented 4 years ago

@svlandeg Hmm no we still don't have a solution to 0 as a value. It's a bit tricky to do in a backwards compatible way. Here are the options as I see them:

  1. Allow users to designate a "missing" value. Currently 0 is used as the missing value, but we could make it some arbitrary other uint64_t value.

  2. Have map_get return a struct that indicates whether the key was found, and the value. This would break backwards compatibility.

  3. Write an alternative to map_get that does return a struct. Use that within PreshMap.get()

I think 3 seems like the best solution? I'll give it a try.

adrianeboyd commented 4 years ago

I probably should have written a test that didn't involved the 0 issue. In the tokenizer everything is long hashes/pointers, so the 0 issue doesn't come up.

from ..maps import PreshMap

def test_one_and_empty():
    table = PreshMap()
    for i in range(100, 110):
        table[i] = i
        del table[i]
    assert table[100] == None

def test_many_and_empty():
    table = PreshMap()
    for i in range(100, 110):
        table[i] = i
    for i in range(100, 110):
        del table[i]
    assert table[100] == None
honnibal commented 4 years ago

Okay, I think I understand this better now.

When we delete a value, we decrement the map_.filled counter, which determines when we should resize the table. However, when we go to insert new values, we don't actually reclaim the deleted slots. This means we can end up in a state where the table is unexpectedly full.

The quick fix is to just not decrement the table.filled counter. This was probably what I was thinking when I added the deletion code, since I basically never deleted values when I was using this table previously, so reclaiming the space didn't matter. The change to decrement the table.filled counter was added in d545448e880 .

The better solution would be to insert values in a way that correctly accounts for deletions. This is a bit tricky, because we need to take care to handle the case where there's a deleted value, but the key is also in the table.

honnibal commented 4 years ago

03953fe is working for me now?