explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.31k stars 4.41k forks source link

Fatal Python error: Aborted when deleting span from SpanGroup #13638

Open NixBiks opened 1 month ago

NixBiks commented 1 month ago

How to reproduce the behaviour

Unfortunately I haven't been able to make a minimal reproducible example. The error happens sometimes when I run my tests - the more tests I run the higher probability of running into the following critical error

Python(69256,0x1fa11cf40) malloc: Region cookie corrupted for region 0x12f800000 (value is c)[0x12f80407c]
Python(69256,0x1fa11cf40) malloc: *** set a breakpoint in malloc_error_break to debug
Fatal Python error: Aborted

I've found out that I don't encounter the error if I remove the following line.

del span.doc.spans["my_spans"][index]

The documentation states this is the way to do it though. Note there is a typo in the code in the docs though.

I realise you most likely can't help with this little info but maybe you can give me some ways to investigate further.

Info about spaCy

NixBiks commented 1 month ago

My current workaround is to create a custom span attribute and filter on that.

Span.set_extension("is_removed", default=False)

Instead of removing the object with del I simply change that custom attribute span.doc.spans["my_spans"][index]._.is_removed = True

honnibal commented 1 month ago

Thanks for the report. I'll have a look at this.

honnibal commented 1 month ago

Is it possible you're removing an element during iteration? From the implementation this wouldn't be expected to work: https://github.com/explosion/spaCy/blob/master/spacy/tokens/span_group.pyx#L113

NixBiks commented 1 month ago

It's inside some iteration but I'm not iterating over the spans in the SpanGroup no. I've iterated earlier to find the index that needs to be removed though. I hope it makes sense.

honnibal commented 1 month ago

I think it makes sense yeah. The pattern you want to avoid is something like:

for span in span_group:
    if check_remove(span):
        del span_group[span]

Deleting during iteration like that won't work, and of course precalculating an index and not adjusting it after you delete will cause problems too. But the latter one shouldn't cause an out-of-bounds access. Unless there's a bug in how we check the indexing, it should only be possible to get the out-of-bounds access from within the class's C methods.

These are the tests we have for the __del__ method, which look okay to me: https://github.com/explosion/spaCy/blob/master/spacy/tests/doc/test_span_group.py#L191

Usually if we have a memory access problem in the tests we do get a crash, because we've got a lot of tests that run in various orders. So it must be some sort of access pattern that's not working. Are you able to share any code?