clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.76k stars 1.58k forks source link

Surprising behavior: Word and Chunk objects do not implement value comparison #210

Open lemontheme opened 6 years ago

lemontheme commented 6 years ago

While computing deltas between sequences of Word or Chunk objects, it's come to my attention that these objects do not implement proper value comparison/equality testing on their __eq__() methods.

Here's an example of what I mean by 'equality testing', using the built-in container type list:

>>> l1, l2 = ["hello"], ["hello"]
>>> l1 is l2  # identity testing
False
>>> l1 == l2  # equality testing
True

And here's Pattern's Sentence object behaving as expected under value and identity comparison:

>>> from copy import copy
>>> sent = Sentence(parse("The elephant sits on the chair"))
>>> sent
Sentence('The/DT/B-NP/O elephant/NN/I-NP/O ... chair/NN/I-NP/I-PNP')
>>> sent is copy(sent)  # object identity testing
False
>>> sent == copy(sent)  # object value testing
True

Contrast the above with the comparison behavior of Pattern's Word and Chunk objects:

>>> sent  # Reusing `sent` from the example above
Sentence('The/DT/B-NP/O elephant/NN/I-NP/O ... chair/NN/I-NP/I-PNP')
>>> word = sent.words[1]  # Looking at Word object
>>> word 
Word('elephant/NN')
>>> word is copy(word)  # identity testing
False  # good
>>> word == copy(word)  # value testing
False  # !!!!! unexpected
>>> chunk = sent.chunks[0]
>>> chunk
Chunk('The elephant/NP')
>>> chunk is copy(chunk)  # identity testing
False  # good
>>> chunk  == copy(chunk)  # value testing
False  # !!!!! unexpected

This comparison behavior is highly surprising, since the objects in both the Chunk and the Word example are equal in terms of the values that they contain, and this is the kind of information that Python's == operator should reflect (as opposed to the separate is keyword).

I can see that the __eq__() method of both Word and Chunk implements value comparison as identity comparison. Here's the code:

    def __eq__(self, <word/chunk>):
        return id(self) == id(<word/chunk>)

By contrast, Sentence does this as:

    def __eq__(self, other):
        if not isinstance(other, Sentence):
            return False
        return len(self) == len(other) and repr(self) == repr(other)

I'm a big fan of the Pattern object model. However, perhaps it might be worth considering extending the latter value comparison implementation to Word and Chunk?

jburb commented 6 years ago

I would heartily upvote this! Hear, hear!