ariovistus / pyd

Interoperability between Python and D
MIT License
158 stars 32 forks source link

Assert `node.index!N.prev is null` fails, seemingly because of BucketHackery #108

Closed evertheylen closed 5 years ago

evertheylen commented 5 years ago

I'm having a rather curious bug. Let me start with saying that I never programmed in D before, but I'm working with the PSI Solver in Python (through psipy). Sadly, I can't share the code that actually triggers this bug (it would be very complex to set up anyway).

Either way, I tried my best to debug the issue by myself. Here's the interesting part of the stacktrace:

<python stacktrace>
    RuntimeError: thrown D Object: core.exception.AssertError: core.exception.AssertError@/home/new_evert/pypath/pyd/infrastructure/util/multi_index.d(195): node.prev = b0
    ----------------
    ??:? _d_assert_msg [0xebf4cc76]
    ??:? pure nothrow @safe void util.multi_index.MNode!(util.multi_index.MultiIndexContainer!(pyd.references.DClass_Py_Mapping, util.multi_index.IndexedBy!(util.multi_index.Hashed!(false, "a.d", "typeid(a).getHash(&a)", "a==b"), "d", util.multi_index.Hashed!(false, "a.py", "typeid(a).getHash(&a)", "a==b"), "python").IndexedBy, util.multi_index.MallocAllocator, util.multi_index.MutableView).MultiIndexContainer, util.multi_index.IndexedBy!(util.multi_index.Hashed!(false, "a.d", "typeid(a).getHash(&a)", "a==b"), "d", util.multi_index.Hashed!(false, "a.py", "typeid(a).getHash(&a)", "a==b"), "python").IndexedBy, util.multi_index.MallocAllocator, util.multi_index.ValueChangedSlots!().ValueChangedSlots.Inner!(util.multi_index.IndexedBy!(util.multi_index.Hashed!(false, "a.d", "typeid(a).getHash(&a)", "a==b"), "d", util.multi_index.Hashed!(false, "a.py", "typeid(a).getHash(&a)", "a==b"), "python").IndexedBy).Inner, pyd.references.DClass_Py_Mapping, pyd.references.DClass_Py_Mapping).MNode.index1.insertNext(util.multi_index.MNode!(util.multi_index.MultiIndexContainer!(pyd.references.DClass_Py_Mapping, util.multi_index.IndexedBy!(util.multi_index.Hashed!(false, "a.d", "typeid(a).getHash(&a)", "a==b"), "d", util.multi_index.Hashed!(false, "a.py", "typeid(a).getHash(&a)", "a==b"), "python").IndexedBy, util.multi_index.MallocAllocator, util.multi_index.MutableView).MultiIndexContainer, util.multi_index.IndexedBy!(util.multi_index.Hashed!(false, "a.d", "typeid(a).getHash(&a)", "a==b"), "d", util.multi_index.Hashed!(false, "a.py"

    Program received signal SIGSEGV, Segmentation fault.
    0x00007fffec429af0 in ... (this=...) at /pyd/infrastructure/util/multi_index.d:230
    230                         assert(next);
    #0  0x00007fffec429af0 in ... (this=...) at /pyd/infrastructure/util/multi_index.d:230
    #1  0x00007fffec428a77 in ... (this=0x55555579c610, n=0x55555582c490) at /pyd/infrastructure/util/multi_index.d:3713
    #2  0x00007fffec42b7d0 in ... (this=0x55555579c610, node=0x55555582c490) at /pyd/infrastructure/util/multi_index.d-mixin-5085:5088
    #3  0x00007fffec42b789 in ... (this=0x55555579c610, node=0x55555582c490) at /pyd/infrastructure/util/multi_index.d:5094
    #4  0x00007fffec42b738 in ... (this=0x55555579c610, __HID70=0x7fffffffd4f8, r=...) at /pyd/infrastructure/util/multi_index.d:3885
    #5  0x00007fffec42b6c0 in ... (this=..., __HID69=0x7fffffffd5d0, _param_0=...) at /pyd/infrastructure/util/multi_index.d-mixin-4977-mixin-5079:5079
    #6  0x00007fffec421153 in ... (self=0x7ffff72f65b0) at /pyd/infrastructure/pyd/references.d:298
    #7  0x00007fffec401f38 in ... (this=...) at /pyd/infrastructure/pyd/class_wrap.d:102
    #8  0x00007fffec407e4f in ... (dg=...) at /pyd/infrastructure/pyd/exception.d:101
    #9  0x00007fffec401f1c in ... (self=0x7ffff72f65b0) at /pyd/infrastructure/pyd/class_wrap.d:108
    #10 0x00007ffff7b08848 in ?? () from /usr/lib/libpython3.7m.so.1.0
    ... repeated ...
    #25 0x00007ffff7b0905d in ?? () from /usr/lib/libpython3.7m.so.1.0
    #26 0x00007ffff7b5fff7 in PyDict_SetItem () from /usr/lib/libpython3.7m.so.1.0
    #27 0x00007ffff7b6057f in PyDict_SetItemString () from /usr/lib/libpython3.7m.so.1.0
    #28 0x00007ffff7bdb696 in PyImport_Cleanup () from /usr/lib/libpython3.7m.so.1.0
    #29 0x00007ffff7c4854b in Py_FinalizeEx () from /usr/lib/libpython3.7m.so.1.0
    #30 0x00007ffff7c4ace4 in ?? () from /usr/lib/libpython3.7m.so.1.0
    #31 0x00007ffff7c4b14c in _Py_UnixMain () from /usr/lib/libpython3.7m.so.1.0
    #32 0x00007ffff7da9223 in __libc_start_main () from /usr/lib/libc.so.6
    #33 0x000055555555505e in _start ()

Somehow, the stacktrace doesn't show the actual frame at which the assertion fails. Either way, this code is the culprit:

void insertNext(typeof(this)* node) nothrow
    in{
        assert(node !is null);
        assert(node.index!N.prev is null,    // <-- this one
            format("node.prev = %x",node.index!N.prev));
        assert(node.index!N.next is null,
            format("node.next = %x",node.index!N.next));
    }body{
        typeof(this)* n = next;
        next = node;
        node.index!N.prev = &this;
        if(n !is null) n.index!N.prev = node;
        node.index!N.next = n;
    }

Commenting out this assert makes my program work. The value of node.index!N.prev that makes the assert fail, is simply 176 or 0xb0 (remains the same between different runs). That's definitly not a valid pointer, so I started checking each assignment to '.prev' in the code. It turns out the value is set by some "BucketHackery" code:

version(BucketHackery){
    node.index!N.prev = cast(ThisNode*)index;  // <-- right here
}else{
    node.index!N.prev = null;
}

I don't know much about the concept of 'versions' in D, but this static assert prevents me from commenting out version = BucketHackery.

So my hypothesis is as follows: BucketHackery, as the name implies, is some hackery code to wrangle out some extra performance in exchange for some complex tricks. The values set by BucketHackery are not supposed to ever 'leak' to insertNext, but in this case they do. Luckily a pointer of value 0xb0 can still kind of be considered null, so commenting out the assert does not actually harm anything.

I'm not sure how to proceed. I could use some help at creating a minimal verifiable example, I don't even really know what multi_index is used for. I can of course provide more details or run tests if needed.

Some details about my environment:

ariovistus commented 5 years ago

let's see here, your stack trace says it is entering the hashed index remove at references.d#L298, which goes into _RemoveAll at multi_index.d#L3885, which eventually eventually gets to _Remove, which should be calling removeFirst, which looks like it is failing to set the prev pointer to null when the next pointer is null, but otherwise isn't calling insertNext or insertPrev.

I don't remember exactly what magic is going on wrt signals, but perhaps remove is calling _NotifyChanges at some point, which is calling _Remove on a first node and failing to clear prev, and then calling insertNext with that node

Could you determine if this is happening, eg with if(isFirst(node)) writeln("Tada!"); just before line 3624 or just after 3611?

evertheylen commented 5 years ago

I don't think it's happening, at least I didn't get the output I would expect. I added a bunch more debug prints, maybe they are useful to you? They also refer to line numbers, which do not line up anymore, so see this fork. I made a log (with the assert commented out) and a full stacktrace (with the failing assert).

ariovistus commented 5 years ago

your stack trace has assert(next); in removeNext originating the exception, but the python exception message suggests it's the line in insertNext. That's weird. so insertNext could be called from _NotifyChanges, _Insert, insert, or reserve. Actually, it looks like the reserve call would trigger the assertion. since it's passing a first node to insertNext. I'll try and worry out a repro

ariovistus commented 5 years ago

does master fix this issue for you?

evertheylen commented 5 years ago

does master fix this issue for you?

Yes! Thanks a lot :)