datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
169 stars 84 forks source link

AutoPopulate should prohibit inserts from outside a populate() call #60

Closed dimitri-yatsenko closed 9 years ago

dimitri-yatsenko commented 9 years ago

automatically populated tables should never be inserted into from outside a populate call. Calling populate should enable insert. Exiting populate should prohibit insert.

fabiansinz commented 9 years ago

I think probably the cleanest way to do this is to override insert in autopopulated classes and call insert from the superclass in populate. insert itself could throw and error that it doesn't like to be called. Autopopulate would define an insert that just throws an error and populate would call the insert from Relation.

eywalker commented 9 years ago

Problem with this is that, overriding of insert has to happen somewhere in the AutoPopulate mixin. However, if the deriving class uses the signature as shown:

class MyTable(dj.Relation, dj.AutoPopulate):

Then the insert method derived from dj.Relation (which of course is getting its insert method from the dj.FreeRelation) will mask whatever the insert method we would implement in dj.AutoPopulate, completing missing the point. In order for the insert in dj.AutoPopulate to take precedence, the inheritance signature has to be:

class MyTable(dj.AutoPopulate, dj.Relation):

I don't like the fact inheritance order will be so critical here. Also I'm not sure if I like the idea of making the mixin class dj.AutoPopulate alter the behavior of the class it got mixed with (in this case the insert method of dj.Relation).

If we really want auto-populated relation to not have public insert method, I think the cleanest way to do this is to make AutoPopulate a subclass of Relation and ask user to inherit from that instead (perhaps rename the class to AutoRelation or something, in that case).

I personally think asking the user to remember to inherit AutoRelation is not asking too much more than to remember to inherit from both AutoPopulate and Relation.

eywalker commented 9 years ago

Also you have to remember the fact that insert has to be called within make_tuples which is ultimately written by the user. So to ask them to explicitly call superclass's insert (via super().insert) is weird to me. This problem will not be resolved even if we let AutoPopulate (or AutoRelation or w/e) inherit from Relation instead.

fabiansinz commented 9 years ago

Ok, then let's maybe take a step back.

I think prohibiting the use of a function that already exists is hard to do in Python without creating a mess (despite the fact that it is pretty unpythonic). Therefore, the function insert should not be there if it is not to be called.

What about the following solution: We create classes ManualRelation and Autorelation that both inherit from relation. Relation.insert is turned private into Relation._insert. ManualRelation wraps _insert with insert. make_tuples in AutoRelation actually returns data (as the name suggests anyway) and AutoRelation.populate calls _insert from populate.

The burden on the user is merely:

dimitri-yatsenko commented 9 years ago

What if insert has a check inside whether it's in an autopopulate class and show a warning if it's used outside a populate class? That's what I had in mind. Just a protected internal _insert_caution_flag.

fabiansinz commented 9 years ago

I think that is somehow messy.

We cannot simply make insert check whether it is in an instance of AutoRelation and deny entering data if it is, because we do need insert in makeTuples (I still think it should be in populate).

I am not a big fan of having flags, but even if we do introduce a _insert_caution_flag, this needs to exist in Relation in order to be checked by ìnsert just for the case that someone mixes in AutoRelation. My point is, we need a class member that exists in all classes but it is only meaningful in some classes. There ought to be a better solution than that.

eywalker commented 9 years ago

I like @fabiansinz's solution of basically removing the need for the user to call insert from within the make_tuples. After all, I think it is more natural for the user to have to only worry about taking care of making (and returning) an appropriate tuple(s) corresponding to a key, but not to have to worry about actually inserting this into table - this should be handled by populate which calls the make_tuples anyway.

This change is rather big deviation from current Matlab implementation and we will have to carefully step through the logic (e.g. how computed subtables can be handled). The nice thing about this implementation is that we can completely remove need for call of insert by user when using AutoRelation - which I would say makes sense be a subclass to Relation. Within AutoRelation, it will simply override the insert method making it unavailable to the user (internally the populate method would call the _insert).

I think we should keep it simple and make regular Relation with a functional insert (which just calls _insert) method and AutoRelation with populate method but no insert, and let the user decide on which to use. I'm not sure how tightly bound we should make the choice of Relation to the type of table.

fabiansinz commented 9 years ago

I like the solution by @eywalker. It's clearer and easier to handle. Plus: There having the tiers in the definition is still justified.

Do you really think it would be a big problem to move insert to populate? After all, if I know how to insert stuff, I should know how to generate it. So I don't think that it will become more difficult for the user. Or do you think that having insert in populate will cause trouble?

eywalker commented 9 years ago

I personally don't think moving the insert into the populate should be difficult at all, but I'm just concerned about the popular use case found in Matlab where makeTuples invokes additional makeTuples of "computed subtables",within which the subtable inserts tuples for itself. I just haven't had enough time to think whether and if so, how such mechanism will be affected by moving insert into populate. I think it will be worth revisiting this idea of computed tables and computed subtables, and see if their relationship can be cleaned up.

dimitri-yatsenko commented 9 years ago

_make_tuples could become a generator that yields tuples and populate insert them. To deal with subtables or supertables, we could call their populates with the specific key. Then populate need to understand that it's inside a transaction and not start a new one.

How about this: if the object has a method called _make_tuples, it should display a warning messages when insert is called? But insert is still inherited just in case.

eywalker commented 9 years ago

I like the sound of the solution - will have to think about exactly how we should implement it.

@dimitri-yatsenko : As for the suggestion about insert, you are suggesting this check because you still think AutoPopulate should be a mixin? If we let there be AutoRelation, then it is very easy to either make

  1. insert unavailable in the sense we hide access to _insert
  2. add a warning message in the insert to remind users that they ought to be using populate

If we are going to stick with AutoPopulate mixin, then Relation's insert method has to do the weird check of seeing if the derived class implements _make_tuples and give warning accordingly. Anytime a class method has to be implemented with an anticipation of a behavior in its derivative, I feel that something is not right. Please disregard this part if I totally misread your suggestion.

fabiansinz commented 9 years ago

I also think that the inheritance solution is better. You will have to explain the problem with super- and sub-tables to me. At the moment, I don't think I am getting it.

eywalker commented 9 years ago

I cannot say that I fully grasp what's needed to deal with super- and sub-tables as their definitions and how they ought to fit to the entire DataJoint concepts has been unclear to me as well. It will be great if we could talk about this on Monday to clarify, so we can rapidly implement them.

FlorianFranzen commented 9 years ago

If you are going with the AutoRelation approach, it would be nice if you could still offer the "visit each pop_real - target" mentioned in #37 as a separate mixin just under a different name.

eywalker commented 9 years ago

I agree that such a visitor-like mixin will be useful. Perhaps AutoRelation can inherit from Relation and such visitor mixin - that way I don't have to write similar code twice :)

On May 9, 2015 5:43 PM, "Florian Franzen" notifications@github.com wrote:

If you are going with the AutoRelation approach, it would be nice if you could still offer the "visit each pop_real - target" mentioned in #37 https://github.com/datajoint/datajoint-python/issues/37 as a separate mixin just under a different name.

— Reply to this email directly or view it on GitHub https://github.com/datajoint/datajoint-python/issues/60#issuecomment-100558074 .

fabiansinz commented 9 years ago

Ok, let's discuss in on Monday and implement it rapidly.

aecker commented 9 years ago

Again, I didn't follow the thread in detail, but I don't think the suggestion of preventing calls to insert() makes sense. This kind of thing is solved by documentation and specification. We can never prevent users from abusing the code. If we try to get smart like that, users will get smarter and come up with even crazier solutions.

dimitri-yatsenko commented 9 years ago

okay, sorry for the confusion. Yes, let's keep the public insert in automatic tables.