datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
170 stars 85 forks source link

Define default pop_rel #59

Closed dimitri-yatsenko closed 9 years ago

dimitri-yatsenko commented 9 years ago

In 90% of cases, the pop_rel is the unrestricted join of the primary dependencies of the table. Should we set that as the default value and let users override that?

eywalker commented 9 years ago

I guess it's fine for the pop_rel to default to that. I'll implement it.

On Fri, May 8, 2015 at 9:11 PM, Dimitri Yatsenko notifications@github.com wrote:

In 90% of cases, the pop_rel is the join of the primary dependencies of the table. Should we set that as the default value and let users override that?

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59.

dimitri-yatsenko commented 9 years ago

since we are all over PEP8 now, is pop_rel the right name? What about populate_relation?

fabiansinz commented 9 years ago

I think we should name it populate_relation, but have a function pop_rel that wraps it for the time being.

eywalker commented 9 years ago

I think populate_relation makes sense and looks cleaner. I'm not too eager on having pop_rel unless we intend to support both names for the foreseeable future - once a name is introduced, it's hard for us to get rid of it without breaking someone else's code.

fabiansinz commented 9 years ago

I generally agree, but it is just one line and makes it easier for people that switch from matlab to python. At some point we can throw a deprecation warning.

eywalker commented 9 years ago

Hmm.. I just don't like the idea of implementing a feature that is already know to get deprecated... given already existing name differences between matlab and python dj in classes and other methods, can we not just ask python users to learn to use populate_relation right off the bat? On May 9, 2015 12:33 PM, "Fabian Sinz" notifications@github.com wrote:

I generally agree, but it is just one line and makes it easier for people that switch from matlab to python. At some point we can throw a deprecation warning.

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100520692 .

fabiansinz commented 9 years ago

I think it depends on how close we want to keep both implementations. If they don't need to be too close, then I completely agree with you, because I definitely like the name populate_relation better.

Maybe you are right, we already created the batch_insert for example. Also makeTuples will probably become make_tuples to properly worship PEP8. So if do that anyway we can just call it populate_relation right away. I just wanted to make sure that this is a conscious design decision.

FlorianFranzen commented 9 years ago

Due to the changes induced by #37 AutoPopulate is not purely about populating tables anymore. I therefore think it would be even legit to rename AutoPopulate, populate(...) and makeTuples(...) completely, as long as we come up with more descriptive names.

One could keep a AutoPopulate class around (that maybe even inherits from Base) that acts as a pure alias class, to avoid confusion and code breaks.

eywalker commented 9 years ago

It is timely that AutoPopulate got mentioned here because it is a subject to change as described in the issue #60 . Let's think through the implementation of auto-populating mechanism carefully before deciding on how to call things.

In general though, because I still consider the Python DataJoint pre-launch and I don't feel a huge obligation (yet) to make Python DataJoint a mirror of Matlab DataJoint (as @fabiansinz and @FloFra noted, Python implementation is already starting to deviate from Matlab a bit), I would rather not place any just for now features into Python DataJoint. This includes any alias methods and properties that's there just to make it look or feel like Matlab. I think it would make sense for us to decide exactly how similar we want these two versions of DataJoint to be.

fabiansinz commented 9 years ago

I think we should implement it how we think it should be done. The compatibility to Matlab should be there on a database level, but apart from that I don't feel the obligation to make it a Matlab clone. This means I agree with @eywalker. I just want to make sure that we agree on that. @eywalker: Shall we discuss this issue on Monday and distribute implementation tasks?

eywalker commented 9 years ago

Sounds good to me ;)

dimitri-yatsenko commented 9 years ago

make_tuples is perhaps not the best name. Sometimes it only makes one tuple for example. How about just make or compute?

dimitri-yatsenko commented 9 years ago

It make_tuples should probably be protected. It's not meant to be called from outside the object.

fabiansinz commented 9 years ago

I think make_tuples is a good name if it produces data tuples (even if it is just one). I agree, that it should be protected when it shouldn't be called from users.

eywalker commented 9 years ago

I agree that make_tuples is a good name even if it can sometimes be wrong grammatically - just like we can be sometimes ;)

By making it protected, are we suggesting that we should name it _make_tuples and assume that the user will respect the convention of not invoking this method from outside? As far as I know, there is really no good way to prevent a user from invoking a method that exists (and whatever method there may be tends to be too complicated i.e. __getattribute__).

eywalker commented 9 years ago

Actually, assuming that we are going to move insert into populate (as currently suggested in #60 ), the only things make_tuples has to do are:

  1. receive in a tuple (a key from populate_relation) and
  2. return a list of tuples to be inserted into the target table

Since it is not going to cause an insert and thus not manipulate the table, I'm not sure if we have to do anything to prevent it from being called outside of populate. Even if they did, no harm should be there, and I always think that interface method (such as make_tuples) ought to be a public method.

That being said, I really have no strong opinion about this, so we could easily make the method protected in Python sense.

dimitri-yatsenko commented 9 years ago

In this case, _make_tuples should be a generator and yield tuples as it goes rather than accumulate its result.

eywalker commented 9 years ago

I like the idea of generators, but would that require the user's implementation of _make_tuples more complex than it potentially needed to be? I was thinking that since _make_tuples is called once per key, a simple implementation like

def _make_tuples(key):
    key['experiment_id'] = 5
    return [key] # make it into a list

ought to work. I guess they can just use yield key instead in the above case, but whether they give back a generator or a simple list shouldn't matter...?

On Sat, May 9, 2015 at 5:17 PM, Dimitri Yatsenko notifications@github.com wrote:

In this case, _make_tuples should be a generator and yield tuples as it goes rather than accumulate its result.

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100555013 .

fabiansinz commented 9 years ago

I am realizing I know too little about the actual implementation. After looking at the populate function I think @eywalker is right. If we call make_tuples once per key, then there is no need for the generator. If anything should be turned into a generator it should be

for key in unpopulated.fetch():
    ...

What I couln't figure out is what n = self(key).count does. self is an instance of Relation, but I couldn't find a meaningful __call__ in it or any of the super classes.

Maybe @dimitri-yatsenko was thinking about something like

self.iter_insert(self._make_tuples(unpopulated))

My problem is that I don't understand how much of the checking around make_tuples in populate would remain if populate calls the insert itself.

Apart from that, I agree that it would be more complicated for the user to write an generator. I like them, but maybe it is just easier to return a single tuple. In particular, because each tuple is individually inserted anyway (batch_insert and iter_insert just call insert).

eywalker commented 9 years ago

Some clarification: although make_tuples gets called with one key at a time, the function can return multiple tuples. Also the call (call) is actually defined in the RelationalOperand class, and is equivalent to fetch.

dimitri-yatsenko commented 9 years ago

yes, inserting multiple tuples is very common. I like the idea of a generator:

def _make_tuples(key):
   info = (Parent() & key).fetch()
   for i in range(4):
        yeild dict(key,  item_number=i, item = compute_item(info, i))
fabiansinz commented 9 years ago

Concerning __call__: That's what I remembered as well, but when you look at the relational_operand module, you'll find:

    def __call__(self, *args, **kwargs):
        return self(*args, **kwargs)

That somehow doesn't seem right to me.

fabiansinz commented 9 years ago

Concerning make_tuples: I guess the question is, can make_tuples return such a huge amount of data that a generator would be more memory efficient. If yes, then we should go with the generator, if no I think we should go with just returning a batch of tuples that get inserted. As @eywalker already mentioned, this is easier to implement for users. Maybe I am missing your logic here @dimitri-yatsenko, but what would be the advantages of having an generator instead of a list?

If it is memory efficiency and if we go through all the items returned by _make_tuples, then _make_tuples actually just needs to return an iterable object to work. That could be a list or an generator. In that case you could go with the easy-to-implement version of returning a list, or the memory-efficient version of returning an iterator itself.

def _make_tuples(key):
   info = (Parent() & key).fetch()
   def ret():
      for i in n:
           yield dict(key,  item_number=i, item = compute_item(info, i))
   return ret
eywalker commented 9 years ago

Ok, clearly the implementation changed... On May 9, 2015 6:32 PM, "Fabian Sinz" notifications@github.com wrote:

Concerning call: That's what I remembered as well, but when you look at the relational_operand module, you'll find:

def __call__(self, *args, **kwargs):
    return self(*args, **kwargs)

That somehow doesn't seem right to me.

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100560478 .

eywalker commented 9 years ago

Also the current implementation of populate is not up to date in the sense that I'm still supposed to go over and reimplement it. On May 9, 2015 6:36 PM, "Edgar Walker" edgar.walker@gmail.com wrote:

Ok, clearly the implementation changed... On May 9, 2015 6:32 PM, "Fabian Sinz" notifications@github.com wrote:

Concerning call: That's what I remembered as well, but when you look at the relational_operand module, you'll find:

def __call__(self, *args, **kwargs):
    return self(*args, **kwargs)

That somehow doesn't seem right to me.

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100560478 .

fabiansinz commented 9 years ago

I think the cursor in fetch would support an iterator. If we can get that to work this would be super-memory efficient. What about moving the logic of fetch to __iter__ and implement fetching from the database with an iterator, and then replacing fetch with basically list(self.__iter__()). Sorry, that's just another change. I think we should discuss this on Monday and populate ourselves with tasks.

dimitri-yatsenko commented 9 years ago

yield has the advantage that _make_tuples does not need to construct an iterable of tuples but return them as they are computed. It also gives user greater control when the insert is done. This may be helpful for populating subtables and supertables.

To review an example of populating a subtable, see reso.Segment, for example in https://github.com/atlab/commons/blob/master/schemas/%2Breso/Segment.m

fabiansinz commented 9 years ago

I see. So the problem is that makeTuples triggers makeTuples in other tables. Is the set of tables for which that can happen constrained in some way?

In general, I think that this function is a little odd. You basically tell it to populate one table, but then-as a side effect-it also populates others.

fabiansinz commented 9 years ago

Concerning yield: I see that yield would be nice in that way, but it is also more awkward for people how don't use yield in daily programming. If you return an iterable, then you have both options.

I don't understand how this would give use more control about when the insert is done. Could you explain this in more detail?

fabiansinz commented 9 years ago

Ok, after thinking about it ... please correct me if I am wrong.

The idea about population is:

What I don't like about this is that it is not very transparent what's happening:

From what I see, we want to be able to trigger a cascade of populating tables by only triggering one population command. On a more abstract level, I think the process should be:

  1. determine what is to be computed (we have that with pop_rel)
  2. compute it (makeTuples)
  3. trigger other computations (we don't have that and emulate it with side effects)

This whole process is carried out by populate which could also do some parallelization.

If we look at it like this, wouldn't it be reasonable to introduce a field called populate_dependents or so that basically tells the relation which populate to call once new data is entered. I think the most logical place to hook that in would be in populate: after insertion of the result of makeTuples populate calls the populate methods of its dependents.

Advantages:

dimitri-yatsenko commented 9 years ago

The idea to understand is what we call subtables. When we populate a set of related items, then we often need a subtable. For example, when we do image segmentation, we populate a set of image segments. Then we create the Segmentation table to represent segmentation and Segment to contain the individual segments. It is important to populate both tables in one transaction. The approach is to make Segmentation an AutoPopulate class and Segment just a Relation class. Then Segmentation.make_tuples calls Segment.make_tuples. Segment does not have its own populate. DataJoint knows that this is a subtable: these are computed or imported tables that are not subclassed from AutoPopulate. On the ERD, subtables are marked with an asterisk.

The cascading option you just described is available on the MATLAB side and, yes, we will may implement it on the Python side as well as a flag in populate. It MATLAB, it's an option in dj.set and it works similarly to how you described it. Generally, cascading is rarely desired because usually we don't compute everything that's in the database but only what is needed. I am not aware of anyone using the cascade option.

fabiansinz commented 9 years ago

I think we should discuss this on Monday. Currently I cannot imagine a clean solution that allows you to have the construction above and to forbid the user to the use of insert in computed tables.

FlorianFranzen commented 9 years ago

Especially when working with large files, subtables allowed you to open and search the file only once while still writing to multiple tables.

If you have to reopen, search and read the same large data file for each unit on each tetrode, populating your tables becomes really slow really quickly and you end up with a lot of not dry code.

aecker commented 9 years ago

I didn't follow the whole thread in detail, but one comment to keep in mind guys:

@fabiansinz (three comments up): In your list of things that are done, (2) and (3) are correct in principle, but the timing when exactly the associated inserts into the parent and child tables happen can depend hugely on the situation. The problem is this: sometimes it's useful to insert the tuple of the parent table right away and then compute the child table's tuples (this is usually the most convenient), but other times you rather compute everything and then insert in the end. The reason is that in the former solution you often run into locking issues when computing the child table's tuples takes a long time.

So while the current implementation may not be the cleanest, but it certainly provides all the flexibility necessary and I would strongly caution against making things to restrictive.

fabiansinz commented 9 years ago

I try to summarize the insights from this thread and the thread of issue #60 so we can come to a resolution.

  1. The exact order in which tuples are inserted in computed tables is crucial for efficiency. Because it seems hard to come up with a general way to transparently specify the processing order, it is implemented implicitly via side effects in makeTuples.
  2. To this end it is essential that makeTuples still inserts the new tuples. This means we cannot prohibit the user to use insert in computed tables. We cannot even raise a warning because insert will definitely be called from inside an AutoPopulate object. For that reason we make it clear in the documentation that insert should not be called on AutoPopulate object except by the function makeTuples.

This means basically that everything stays as it is. Glad we talked about it.

What I (and apparently @eywalker) still are unclear about is the exact definition of subtable or supertable. According to @dimitri-yatsenko, they are "... computed or imported tables that are not subclassed from AutoPopulate". To me, that is true for all tables that are not children of AutoPopulate including FreeRelations. It would be good if someone could clarify.

In terms of what's to be implemented to resolve issue #60 and #59:

  1. pop_rel will be renamed into population_relation and become the join of the primary dependencies of the table by default.
  2. makeTuples will become make_tuples to conform to PEP8.
  3. AutoPopulate will remain a mixin and we stress in insert that it is not to be used in computed tables.
dimitri-yatsenko commented 9 years ago

I will start a new issue discussing the support of subtables, which I will begin by defining subtables.

Let's not talk about "side effects" of _make_tuples because there should not be any.

The order of processing tables is specified very clearly by dependencies (foreign keys). The order of populating a single relation does not matter that much.

Let's call it _make_tuples rather than make_tuples to make it clear that it's a callback invoked by a public method of the object.

fabiansinz commented 9 years ago

If the order of processing (i.e. computing and inserting tuples) were clearly specified by the foreign keys, we could outsource the insertion to populate: first generate the tuples in the tables of the foreign keys, then generate our own tuples. Since we apparently cannot do it, it is less clear to me. However, I might miss some of the logic here.

Concerning side-effects: if _make_tuples generates not only the tuples for its own table, but also causes other tables to generate tuples, that's pretty much the definition of a side-effect. I think it is ok to have that for the sake of efficieny and flexibility, but we can still call it what it is.

eywalker commented 9 years ago

Looks like I missed some part of the discussion - why are we falling back to using mixin rather than subclassing from AutoRelation? I think we ought to talk about this on Monday to see if no clear solution can exist to still let _make_tuples generate tuples while supporting the population of computed subtables. I agree with @aecker that we should not invest too much effort really restricting what methods are used by the user, but I thought it still made sense to make make_tuples really about what it sounds like (i.e. making and returning tuples, rather than inserting them itself).

eywalker commented 9 years ago

Just to clarify things for myself... One key to subtables has been the fact that the function signature of make_tuples in the subtables are not actually set. For example, in some subtables' make_tuples, you might pass in a key tuple (just like make_tuples used by populate) but also extra arguments such as an image data loaded inside the parent table in its make_tuples. The order of tuple insertion between parent and children(subtables) becomes important because the subtable usually has a foreign key to the parent table, necessitating that the the parent table tuple already exist when subtable gets populated with tuple(s) corresponding to that parent table tuple entry (otherwise MySQL complains about ForeignKey not satisfied!).

aecker commented 9 years ago

Edgar, you got it, but there are still two ways of inserting

(a) compute and insert tuple into parent table, then call subtable's mateTuples, which does the same (b) compute tuples for parent and subtables, then insert into parent and subtables (often in this case one doesn't even implement makeTuples in the subtable). I use this solution often when I have time-consuming populates, since inserting into the parent table will create a row-level lock until you exit the transaction, which often blocks other parallel inserts, leads to timeout errors and slows things down a lot.

eywalker commented 9 years ago

@aecker Ah, I haven't thought about the case (b)... I think we are concluding that it is easiest just to keep the current Matlab approach and not impose too much structure onto users.

dimitri-yatsenko commented 9 years ago

In our meeting today, we have agreed to stay with using insert in _make_tuples.

With regard to subtables, we have agreed to stick with the same liberal solution as in MATLAB. DataJoint will not prohibit inserts, deletes, and drops in subtables, but will issue a warning.

The only major enhancement is that dj.Relation will provide a default populate_relation, which is what this issue was about to begin with.