IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Proposal: FunctorM operator and repository streamsx.relational for it. #65

Closed hildrum closed 9 years ago

hildrum commented 9 years ago

The standard toolkit's Functor operator creates a new tuple for each input tuple. For large tuples, particularly when multiple Functors are used in processing, the cost of generating a new tuple can be noticeable.

I am proposing a new operator FunctorM that changes the input tuple rather than creating a new tuple. Unlike Functor, its input port would be mutating. This operator would be restricted to the case where:

Depending on the tuple size and the number of Functors, using FunctorM can be 2x faster.

I picked the name FunctorM after the standard toolkit's insert/insertM and so on.
There doesn't seem to be a repository in which this would fit, so we'd probably need a new one. Functor is in spl.relational, so it would make sense to put this in the namespace com.ibm.streamsx.relational. I think the repository name should be streamsx.relational.

leongor commented 9 years ago

+1

mikespicer commented 9 years ago

Is there a way that we could achieve this behavior without a new operator? Could we automatically determine when a copy is needed and use the mutable update by default unless a copy is needed?

hildrum commented 9 years ago

@mikespicer I don't believe so. You could generate code to do as you say, but requires the input port to be mutating in the operator model (Functors port is non-mutating), and that could cause extra tuple copying in the runtime.

For example, lets say operator A sends its tuples to both F1 and F2. Consider the case where A, F1, and F2 are fused together--if both F1 and F2 have mutating input ports, then the runtime makes a tuple copy. But if both F1 and F2 have a non-mutating port, the runtime won't make a copy. That's true regardless of whatever fancy code generation happens inside F1 and F2.

If you want to avoid this runtime copy when it's not necessary, you need to have a Functor with a non-mutating input port. If you want to avoid creating a new tuple when its not necessary, you need to have a mutating input port. So I think there needs to be two operators.

leongor commented 9 years ago

In second thought, FunctorM is not the best option because it sounds like having the same functionality as Functor, only a mutable version, but it does not. So lets consider different name - my suggestions: Update Touch

ddebrunner commented 9 years ago

Since these operators would have the same api, when would an SPL developer know which one to use?

chanskw commented 9 years ago

I am confused with your explanation about why we cannot have one operator that does both, and how the new operator benefits.

You said:

As I see this, there is at least one copy happening regardless of which operator we use. How do we achieve the performance boost with FunctorM? Am I missing something?

hildrum commented 9 years ago

@leongor I view this as a direct replacement to Functor which is why I used "Functor" in the name. I thought of MutatingFunctor, but FunctorM is less typing and follows the convention used in the standard toolkit for functions on collections (ie, the mutating version ends in "M") Of your two alternatives, I'm okay with Update, but I don't like Touch, as it doesn't make it clear it's changing the tuple. What about "Modify"?

@ddebrunner I was hoping the "M" would tell the developer that this operator mutates the input tuple. I think the general idea would be to use FunctorM (or whatever we name it) when the input type and output type are the same, and use Functor otherwise.

@chanskw Whether the runtime makes a copy is a bit complicated and depends on the operator graph.

An input port that is mutating according to the operator model causes the runtime to make a copy in certain circumstances, whatever the operator does internally. If the operator doesn't actually change the tuple, this copy is unnecessary. Giving the current Functor an mutating input port will cause the runtime to make unnecessary copies.

But a mutating input port doesn't always mean a copy occurs--for example, in the case of a chain of operators, no copying will occur (as long as mutation is allowed on the output ports, which it is for this operator). This page from our documentation describes such an example, but to avoid the tuple creation in Functor, it suggests that users switch Functor to Custom rather than using an operator that works like a mutating version of Functor.

hildrum commented 9 years ago

I'm leaning towards Modify as the operator name.

Users should use Modify when input and output types are the same, and Functor otherwise. (This will never result in more copies than Functor and sometimes will result in fewer.)

Modify will give a compile-time error if the input type and output type don't match.

There is also one subtle scenario where Modify will give a compile warning. If the a modified attribute is used as part of another attribute's expression, we need to give a warning because it's not precisely clear what the user intended:

output O:
   a = a + 1,  // this is fine alone
   b = a +1;  // this is fine alone, too, but with the previous expression it's ambiguous 
              // if a was 5, is b supposed to be 6 or 7?
ddebrunner commented 9 years ago

I like Modify as the name, it also matches the Java Application API.

leongor commented 9 years ago

I agree - Modify is better.

mikespicer commented 9 years ago

Modify works for me.

chanskw commented 9 years ago

Now that we have agreed on a name for the operator, if you agree that we should create a repository for this operator, please add your +1 vote.

We currently do not have enough +1 vote to create the repository. I feel that this should be part of the standard toolkit and not a separate toolkit.

Also, are we in agreement that we should have a com.ibm.streamsx.relational toolkit for the Modify operator. Is this scoped too small, would we want a toolkit name that allow us to expand to include other functions?

ddebrunner commented 9 years ago

+1 to a new repository. It seems it's analogous to streamsx.plumbing but for operators that allow modification of tuples or can emcompas application logic. I could see it having composites that are common use cases for Join,Aggregate etc, as well as new primitive operators.

Not sure if relational is the correct name, are there operators that might be in this toolkit that would not fall into a relational calculus category?

gabijs commented 9 years ago

If relational is really the best name, I believe it is best to be put in the std toolkit. We can then update the port mutability example to have the Modify operator instead of Custom.

ddebrunner commented 9 years ago

I think there needs to be a repository for operators that people would like to contribute that fit into this category. Modify seems to be one.

scotts commented 9 years ago

Putting an operator on github does not preclude us from also including it, or a version of it, in the standard library. See, for example, ElasticLoadBalance. I would like github to be a staging area for new functionality that may be included in the standard library, similar to how boost has libraries that have eventually made it into the C++ standard library. Toolkits on github allow us to put up prototypes and get feedback before including them in the standard library.

With that said, I think Modify should eventually go into the SPL standard library, but I also think we should put it up on github now. The name change changed my mind; the new name makes it clear what it does, and it is a basic building block for streaming applications. But the new release is months away, and it would be useful to get feedback on it now.

I don't think the name streamsx.relational is appropriate, as it is too narrow. I like @ddebrunner's suggestion to have a toolkit that is complementary to streamsx.plumbing. I can also see contributing experimental Aggregate and Join operators. Perhaps streamsx.core?

ddebrunner commented 9 years ago

streamsx.core seems too generic, I could make the case that plumbing is core. Some name specific to data modification or application logic, @scotts could you put into works what category of "standard" operators should not be put into p;lubming, maybe that will help define a name?

scotts commented 9 years ago

Agreed with @ddebrunner that "core" is very broad - in fact, some languages use "core" instead of "standard".

Plumbing operators effect the flow of tuples exclusively; they do not inspect or manipulate the data inside the tuple. The opposite of that would be operators that do inspect or manipulate the data inside tuples, and may effect the flow of tuples. That means a new Join operator would belong in our yet-unnamed toolkit, because while it effects the flow of tuples, it has to look at the data in those tuples to do so. ElasticLoadBalance is in plumbing because it changes the flow without looking at the data in the tuples.

I thought of "logic", but I dismissed it since it's too close to "relational", and could imply mathematical logic, not application logic. @hildrum suggested "transform", which I thought was too opaque, but I'm starting to think may be a better option than our current alternatives.

Looking again at the SPL standard library, there's only really three subcategories: relational, adapter, and utility. Just about everything in utility would be a plumbing operator. Maybe we should just call this new toolkit streamsx.relational, and add streamsx.adapter if we (or others) want to add any new kinds of adapters.

mikespicer commented 9 years ago

@scotts Unfortunately I don't think everything in the utility sub category would fit into the definition of plumbing as it includes operators like dynamicFilter and parse/format. I agree that transform is starting to sound better. Another suggestion would be "processing" but that seems more generic. I'd suggest that we go with transform (by the end of the week or sooner if we all agree) unless a consensus forms around another name or someone strongly disagrees.

hildrum commented 9 years ago

+100000000 on figuring this out by the end of the week. streamsx.transform sounds fine to me.

Assuming my vote doesn't count because I'm the proposer, there are currently two explicit votes for creating the repository and two implicit votes (from @mikespicer and @scotts who didn't type "+1", but who seem to think of the repository should be created).

mikespicer commented 9 years ago

+1 on creating the repository. Transform works for me as a name.

leongor commented 9 years ago

+1 on creating the repository. +1 for Transform.

2015-08-05 18:42 GMT+03:00 Mike Spicer notifications@github.com:

+1 on creating the repository. Transform works for me as a name.

— Reply to this email directly or view it on GitHub https://github.com/IBMStreams/administration/issues/65#issuecomment-128044910 .

Best regards, Leonid Gorelik.

chanskw commented 9 years ago

Thanks everyone. Will create repository streamsx.transform next week. Operator name will be FunctorM.

Initial committer will be Kris Hildrum.

scotts commented 9 years ago

I believe the consensus that the operator name will be Modify.

chanskw commented 9 years ago

Ok... missed that... THANK YOU!

chanskw commented 9 years ago

created repository, closing