biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.78k stars 996 forks source link

Table: copy method #1052

Closed s-alexey closed 8 years ago

s-alexey commented 8 years ago

Current implementation looks quite strange:

    def copy(self):
        """
        Return a copy of the table
        """
        t = self.__class__(self)
        t.ensure_copy()
        return t

    def ensure_copy(self):
        """
        Ensure that the table owns its data; copy arrays when necessary
        """
        if self.X.base is not None:
            self.X = self.X.copy()
        if self._Y.base is not None:
            self._Y = self._Y.copy()
        if self.metas.base is not None:
            self.metas = self.metas.copy()
        if self.W.base is not None:
            self.W = self.W.copy()

self.__class__(self) call runs init method with self as a second (typicaly X) argument. This isn't desired behavior.

Possible solutions: use copy module

import copy 

...
    def copy(self):
        """
        Return a copy of the table
        """
        return copy.deepcopy(self)

or just change to self.__class__.__init__(self)

I can make PL, but I don't know what is the best solution.

kernc commented 8 years ago

Current implementation looks quite strange

I agree.

I don't know what is the best solution.

I agree.

Does anything break?

s-alexey commented 8 years ago

Current Text-Mining add-on implementation use it in Corpus class and it leads to some problems there.

kernc commented 8 years ago

Indeed. Do Orange's unit tests (python ./setup.py test) still pass? If they do, please submit a pull request.

janezd commented 8 years ago

Although it sounds reasonable and obvious, I'd be careful with this. @astaric and @lanzagar have to confirm that this change has no side effects. I may be the original author of this code (or perhaps it's @astaric), but I can't remember why it's written like this. It can be a blunder, or it may be intentional.

Do numpy array implement deep copy efficiently? (I just quickly checked - it seems that they have a __deepcopy__ method.)

s-alexey commented 8 years ago

I'm not sure whether deepcopy is the best solution.

t = self.__class__(self)

here we create a new object with self as a second argument. It's dangerous.

Here a simple example of problems that may occur:

class Foo(object):
    def __init__(self, a, b):
        self.a = a

        if isinstance(a, Foo):
            raise ValueError("a can't be Foo inctance")

        self.b = b

        # in additional it may take a lot of time
        self.c = hard_computation_problem(a, b)

    def copy(self):
        # here a = self, b is not provided
        # so it raises a TypeError
        c = self.__class__(self)

        c.a = copy(self.a)
        c.b = copy(self.b)
        return c

class Bar(Foo):
    def __init__(self, a, b=2):
        # here b is present, but isinstance(a, Foo) == True
        super(Bar, self).__init__(a, b)

and this code raises TypeError

>>> obj = Foo(1, 2)
>>> c = obj.copy()

or ValueError:

>>> obj = Bar(1)
>>> c = obj.copy()
janezd commented 8 years ago

You're right that self.__class__(self) can trigger errors if __init__ does not accept self as an argument. But Table.__init__ does.

I thought about it some more. :) When an object copies itself - with either deep or shallow copy - it will have to call the constructor in some way. copy.deepcopy(self) does the same thing as copy does now: it calls the constructor (probably through pickling?) and then deepcopies the attributes (I hope). Or lets the pickling do the job. Replacing the current code with deepcopy just makes it less efficient.

The method copy is essentially __deepcopy__, just without the memo. We can implement __deepcopy__ (it will look like the current copy) and __copy__ (similar, but without ensure_copy), and then copy can call __deepcopy__.

If the current implementation is broken in derived classes, derived classes have to be fixed by improving the constructor and/or overloading __copy__ and __deepcopy__.

s-alexey commented 8 years ago

Thank you, @janezd. I was a little confused, but I have figured it out now.

copy.deepcopy(self) does the same thing as copy does now

there is a difference: Table.copy doesn't copy domain attribute.

Finally, lets c = t.copy(). Whether it is right that c.ids == t.ids?

janezd commented 8 years ago

Oh, yes, and this is really important: the domain shouldn't be deepcopied. Copying it wouldn't break anything, but it would imply that the new table does not come from the same domain and decrease the efficiency by adding unnecessary conversions.

I'm not sure about ids, but I believe it's OK to have c.ids is t.ids because we can consider them to be immutable.

s-alexey commented 8 years ago

I'm not sure about ids, but I believe it's OK to have c.ids is t.ids because we can consider them to be immutable.

I used to think so, but ...

>>> from Orange.data import Table
>>> d = Table("iris")
>>> x = d.copy()
>>> x.ids is d.ids
False
>>> (x.ids == d.ids)[:5]
array([ True,  True,  True,  True,  True], dtype=bool)
>>> d.extend(x)
>>> d.ids[:5], d.ids[len(x):len(x)+5]
(array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]))
>>> 
>>> import numpy as np
>>> d = Table(np.zeros((5, 2)))
>>> x = d.copy()
>>> d.extend(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alexey/orange/orange3/Orange/data/table.py", line 869, in extend
    self._resize_all(old_length + len(instances))
  File "/home/alexey/orange/orange3/Orange/data/table.py", line 630, in _resize_all
    self.X.resize(new_length, self.X.shape[1])
ValueError: cannot resize this array: it does not own its data
janezd commented 8 years ago

You're right, ids can't be shared, and copy creates new ids (through constructor).

The exception that you got is correct: Table's constructor does not copy the data matrix and so it can't change it.

This reminds me: @astaric, @kernc, is there any place where we change Table (and can't avoid it) or could we make it immutable?

kernc commented 8 years ago

We make tables immutable! I love you!We use Pandas. :+1:

ajdapretnar commented 8 years ago

@s-alexey Is the issue still pertinent? From what I know we decided to keep the current implementation and wait for pandas. If that's the case, I kindly invite someone with the authority to close the issue. :smile: If not, keep the magic going!