Closed dimitri-yatsenko closed 9 years ago
I guess it's fine for the pop_rel
to default to that. I'll implement it.
On Fri, May 8, 2015 at 9:11 PM, Dimitri Yatsenko notifications@github.com wrote:
In 90% of cases, the pop_rel is the join of the primary dependencies of the table. Should we set that as the default value and let users override that?
— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59.
since we are all over PEP8 now, is pop_rel
the right name? What about populate_relation
?
I think we should name it populate_relation
, but have a function pop_rel
that wraps it for the time being.
I think populate_relation
makes sense and looks cleaner. I'm not too eager on having pop_rel
unless we intend to support both names for the foreseeable future - once a name is introduced, it's hard for us to get rid of it without breaking someone else's code.
I generally agree, but it is just one line and makes it easier for people that switch from matlab to python. At some point we can throw a deprecation warning.
Hmm.. I just don't like the idea of implementing a feature that is already
know to get deprecated... given already existing name differences between
matlab and python dj in classes and other methods, can we not just ask
python users to learn to use populate_relation
right off the bat?
On May 9, 2015 12:33 PM, "Fabian Sinz" notifications@github.com wrote:
I generally agree, but it is just one line and makes it easier for people that switch from matlab to python. At some point we can throw a deprecation warning.
— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100520692 .
I think it depends on how close we want to keep both implementations. If they don't need to be too close, then I completely agree with you, because I definitely like the name populate_relation
better.
Maybe you are right, we already created the batch_insert
for example. Also makeTuples
will probably become make_tuples
to properly worship PEP8. So if do that anyway we can just call it populate_relation
right away. I just wanted to make sure that this is a conscious design decision.
Due to the changes induced by #37 AutoPopulate
is not purely about populating tables anymore. I therefore think it would be even legit to rename AutoPopulate
, populate(...)
and makeTuples(...)
completely, as long as we come up with more descriptive names.
One could keep a AutoPopulate
class around (that maybe even inherits from Base) that acts as a pure alias class, to avoid confusion and code breaks.
It is timely that AutoPopulate
got mentioned here because it is a subject to change as described in the issue #60 . Let's think through the implementation of auto-populating mechanism carefully before deciding on how to call things.
In general though, because I still consider the Python DataJoint pre-launch and I don't feel a huge obligation (yet) to make Python DataJoint a mirror of Matlab DataJoint (as @fabiansinz and @FloFra noted, Python implementation is already starting to deviate from Matlab a bit), I would rather not place any just for now features into Python DataJoint. This includes any alias methods and properties that's there just to make it look or feel like Matlab. I think it would make sense for us to decide exactly how similar we want these two versions of DataJoint to be.
I think we should implement it how we think it should be done. The compatibility to Matlab should be there on a database level, but apart from that I don't feel the obligation to make it a Matlab clone. This means I agree with @eywalker. I just want to make sure that we agree on that. @eywalker: Shall we discuss this issue on Monday and distribute implementation tasks?
Sounds good to me ;)
make_tuples
is perhaps not the best name. Sometimes it only makes one tuple for example. How about just make
or compute
?
It make_tuples
should probably be protected. It's not meant to be called from outside the object.
I think make_tuples
is a good name if it produces data tuples (even if it is just one). I agree, that it should be protected when it shouldn't be called from users.
I agree that make_tuples
is a good name even if it can sometimes be wrong grammatically - just like we can be sometimes ;)
By making it protected, are we suggesting that we should name it _make_tuples
and assume that the user will respect the convention of not invoking this method from outside? As far as I know, there is really no good way to prevent a user from invoking a method that exists (and whatever method there may be tends to be too complicated i.e. __getattribute__
).
Actually, assuming that we are going to move insert
into populate
(as currently suggested in #60 ), the only things make_tuples
has to do are:
populate_relation
) andSince it is not going to cause an insert and thus not manipulate the table, I'm not sure if we have to do anything to prevent it from being called outside of populate
. Even if they did, no harm should be there, and I always think that interface method (such as make_tuples
) ought to be a public method.
That being said, I really have no strong opinion about this, so we could easily make the method protected in Python sense.
In this case, _make_tuples
should be a generator and yield tuples as it goes rather than accumulate its result.
I like the idea of generators, but would that require the user's
implementation of _make_tuples
more complex than it potentially needed to
be?
I was thinking that since _make_tuples
is called once per key, a simple
implementation like
def _make_tuples(key):
key['experiment_id'] = 5
return [key] # make it into a list
ought to work. I guess they can just use yield key
instead in the above
case, but whether they give back a generator or a simple list shouldn't
matter...?
On Sat, May 9, 2015 at 5:17 PM, Dimitri Yatsenko notifications@github.com wrote:
In this case, _make_tuples should be a generator and yield tuples as it goes rather than accumulate its result.
— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100555013 .
I am realizing I know too little about the actual implementation. After looking at the populate
function I think @eywalker is right. If we call make_tuples
once per key, then there is no need for the generator. If anything should be turned into a generator it should be
for key in unpopulated.fetch():
...
What I couln't figure out is what n = self(key).count
does. self
is an instance of Relation
, but I couldn't find a meaningful __call__
in it or any of the super classes.
Maybe @dimitri-yatsenko was thinking about something like
self.iter_insert(self._make_tuples(unpopulated))
My problem is that I don't understand how much of the checking around make_tuples
in populate
would remain if populate
calls the insert itself.
Apart from that, I agree that it would be more complicated for the user to write an generator. I like them, but maybe it is just easier to return a single tuple. In particular, because each tuple is individually inserted anyway (batch_insert
and iter_insert
just call insert
).
Some clarification: although make_tuples gets called with one key at a time, the function can return multiple tuples. Also the call (call) is actually defined in the RelationalOperand class, and is equivalent to fetch.
yes, inserting multiple tuples is very common. I like the idea of a generator:
def _make_tuples(key):
info = (Parent() & key).fetch()
for i in range(4):
yeild dict(key, item_number=i, item = compute_item(info, i))
Concerning __call__
: That's what I remembered as well, but when you look at the relational_operand module, you'll find:
def __call__(self, *args, **kwargs):
return self(*args, **kwargs)
That somehow doesn't seem right to me.
Concerning make_tuples: I guess the question is, can make_tuples return such a huge amount of data that a generator would be more memory efficient. If yes, then we should go with the generator, if no I think we should go with just returning a batch of tuples that get inserted. As @eywalker already mentioned, this is easier to implement for users. Maybe I am missing your logic here @dimitri-yatsenko, but what would be the advantages of having an generator instead of a list?
If it is memory efficiency and if we go through all the items returned by _make_tuples, then _make_tuples actually just needs to return an iterable object to work. That could be a list or an generator. In that case you could go with the easy-to-implement version of returning a list, or the memory-efficient version of returning an iterator itself.
def _make_tuples(key):
info = (Parent() & key).fetch()
def ret():
for i in n:
yield dict(key, item_number=i, item = compute_item(info, i))
return ret
Ok, clearly the implementation changed... On May 9, 2015 6:32 PM, "Fabian Sinz" notifications@github.com wrote:
Concerning call: That's what I remembered as well, but when you look at the relational_operand module, you'll find:
def __call__(self, *args, **kwargs): return self(*args, **kwargs)
That somehow doesn't seem right to me.
— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100560478 .
Also the current implementation of populate
is not up to date in the
sense that I'm still supposed to go over and reimplement it.
On May 9, 2015 6:36 PM, "Edgar Walker" edgar.walker@gmail.com wrote:
Ok, clearly the implementation changed... On May 9, 2015 6:32 PM, "Fabian Sinz" notifications@github.com wrote:
Concerning call: That's what I remembered as well, but when you look at the relational_operand module, you'll find:
def __call__(self, *args, **kwargs): return self(*args, **kwargs)
That somehow doesn't seem right to me.
— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/59#issuecomment-100560478 .
I think the cursor in fetch would support an iterator. If we can get that to work this would be super-memory efficient. What about moving the logic of fetch to __iter__
and implement fetching from the database with an iterator, and then replacing fetch
with basically list(self.__iter__())
. Sorry, that's just another change. I think we should discuss this on Monday and populate ourselves with tasks.
yield
has the advantage that _make_tuples
does not need to construct an iterable of tuples but return them as they are computed. It also gives user greater control when the insert is done. This may be helpful for populating subtables and supertables.
To review an example of populating a subtable, see reso.Segment
, for example in https://github.com/atlab/commons/blob/master/schemas/%2Breso/Segment.m
I see. So the problem is that makeTuples triggers makeTuples in other tables. Is the set of tables for which that can happen constrained in some way?
In general, I think that this function is a little odd. You basically tell it to populate one table, but then-as a side effect-it also populates others.
Concerning yield: I see that yield would be nice in that way, but it is also more awkward for people how don't use yield in daily programming. If you return an iterable, then you have both options.
I don't understand how this would give use more control about when the insert is done. Could you explain this in more detail?
Ok, after thinking about it ... please correct me if I am wrong.
The idea about population is:
populate
method of a computed table which determines what new entries to compute using pop_rel
and calles makeTuples
with those keys to compute its own entries. makeTuples
might call other makeTuples
to trigger the computation of new entries in those.What I don't like about this is that it is not very transparent what's happening:
makeTuples
has side effects that are only apparent for the person who wrote itmakeTuple
you have to know the appropriate format of the key of that table. What if that changes?From what I see, we want to be able to trigger a cascade of populating tables by only triggering one population command. On a more abstract level, I think the process should be:
pop_rel
)makeTuples
)This whole process is carried out by populate
which could also do some parallelization.
If we look at it like this, wouldn't it be reasonable to introduce a field called populate_dependents
or so that basically tells the relation which populate
to call once new data is entered. I think the most logical place to hook that in would be in populate
: after insertion of the result of makeTuples
populate
calls the populate
methods of its dependents.
Advantages:
makeTuples
is clearly defined: compute new tuples and return theminsert
on computed tables, because we call _insert
ourselves. The idea to understand is what we call subtables. When we populate a set of related items, then we often need a subtable. For example, when we do image segmentation, we populate a set of image segments. Then we create the Segmentation
table to represent segmentation and Segment
to contain the individual segments. It is important to populate both tables in one transaction. The approach is to make Segmentation
an AutoPopulate class and Segment
just a Relation class. Then Segmentation.make_tuples
calls Segment.make_tuples
. Segment
does not have its own populate
. DataJoint knows that this is a subtable: these are computed or imported tables that are not subclassed from AutoPopulate
. On the ERD, subtables are marked with an asterisk.
The cascading option you just described is available on the MATLAB side and, yes, we will may implement it on the Python side as well as a flag in populate
. It MATLAB, it's an option in dj.set
and it works similarly to how you described it. Generally, cascading is rarely desired because usually we don't compute everything that's in the database but only what is needed. I am not aware of anyone using the cascade option.
I think we should discuss this on Monday. Currently I cannot imagine a clean solution that allows you to have the construction above and to forbid the user to the use of insert
in computed tables.
Especially when working with large files, subtables allowed you to open and search the file only once while still writing to multiple tables.
If you have to reopen, search and read the same large data file for each unit on each tetrode, populating your tables becomes really slow really quickly and you end up with a lot of not dry code.
I didn't follow the whole thread in detail, but one comment to keep in mind guys:
@fabiansinz (three comments up): In your list of things that are done, (2) and (3) are correct in principle, but the timing when exactly the associated inserts into the parent and child tables happen can depend hugely on the situation. The problem is this: sometimes it's useful to insert the tuple of the parent table right away and then compute the child table's tuples (this is usually the most convenient), but other times you rather compute everything and then insert in the end. The reason is that in the former solution you often run into locking issues when computing the child table's tuples takes a long time.
So while the current implementation may not be the cleanest, but it certainly provides all the flexibility necessary and I would strongly caution against making things to restrictive.
I try to summarize the insights from this thread and the thread of issue #60 so we can come to a resolution.
makeTuples
.makeTuples
still inserts the new tuples. This means we cannot prohibit the user to use insert in computed tables. We cannot even raise a warning because insert
will definitely be called from inside an AutoPopulate
object. For that reason we make it clear in the documentation that insert
should not be called on AutoPopulate
object except by the function makeTuples
.This means basically that everything stays as it is. Glad we talked about it.
What I (and apparently @eywalker) still are unclear about is the exact definition of subtable or supertable. According to @dimitri-yatsenko, they are "... computed or imported tables that are not subclassed from AutoPopulate". To me, that is true for all tables that are not children of AutoPopulate
including FreeRelation
s. It would be good if someone could clarify.
In terms of what's to be implemented to resolve issue #60 and #59:
pop_rel
will be renamed into population_relation
and become the join of the primary dependencies of the table by default. makeTuples
will become make_tuples
to conform to PEP8. AutoPopulate
will remain a mixin and we stress in insert that it is not to be used in computed tables. I will start a new issue discussing the support of subtables, which I will begin by defining subtables.
Let's not talk about "side effects" of _make_tuples
because there should not be any.
The order of processing tables is specified very clearly by dependencies (foreign keys). The order of populating a single relation does not matter that much.
Let's call it _make_tuples
rather than make_tuples
to make it clear that it's a callback invoked by a public method of the object.
If the order of processing (i.e. computing and inserting tuples) were clearly specified by the foreign keys, we could outsource the insertion to populate
: first generate the tuples in the tables of the foreign keys, then generate our own tuples. Since we apparently cannot do it, it is less clear to me. However, I might miss some of the logic here.
Concerning side-effects: if _make_tuples
generates not only the tuples for its own table, but also causes other tables to generate tuples, that's pretty much the definition of a side-effect. I think it is ok to have that for the sake of efficieny and flexibility, but we can still call it what it is.
Looks like I missed some part of the discussion - why are we falling back to using mixin rather than subclassing from AutoRelation
? I think we ought to talk about this on Monday to see if no clear solution can exist to still let _make_tuples
generate tuples while supporting the population of computed subtables. I agree with @aecker that we should not invest too much effort really restricting what methods are used by the user, but I thought it still made sense to make make_tuples
really about what it sounds like (i.e. making and returning tuples, rather than inserting them itself).
Just to clarify things for myself... One key to subtables has been the fact that the function signature of make_tuples
in the subtables are not actually set. For example, in some subtables' make_tuples
, you might pass in a key tuple (just like make_tuples
used by populate
) but also extra arguments such as an image data loaded inside the parent table in its make_tuples
. The order of tuple insertion between parent and children(subtables) becomes important because the subtable usually has a foreign key to the parent table, necessitating that the the parent table tuple already exist when subtable gets populated with tuple(s) corresponding to that parent table tuple entry (otherwise MySQL complains about ForeignKey not satisfied!).
Edgar, you got it, but there are still two ways of inserting
(a) compute and insert tuple into parent table, then call subtable's mateTuples, which does the same (b) compute tuples for parent and subtables, then insert into parent and subtables (often in this case one doesn't even implement makeTuples in the subtable). I use this solution often when I have time-consuming populates, since inserting into the parent table will create a row-level lock until you exit the transaction, which often blocks other parallel inserts, leads to timeout errors and slows things down a lot.
@aecker Ah, I haven't thought about the case (b)... I think we are concluding that it is easiest just to keep the current Matlab approach and not impose too much structure onto users.
In our meeting today, we have agreed to stay with using insert
in _make_tuples
.
With regard to subtables, we have agreed to stick with the same liberal solution as in MATLAB. DataJoint will not prohibit inserts, deletes, and drops in subtables, but will issue a warning.
The only major enhancement is that dj.Relation
will provide a default populate_relation
, which is what this issue was about to begin with.
In 90% of cases, the
pop_rel
is the unrestricted join of the primary dependencies of the table. Should we set that as the default value and let users override that?