coala / coala-bears

Bears for coala
https://coala.io/
GNU Affero General Public License v3.0
294 stars 580 forks source link

Bear proposal: Condition expression reordering with short-cuirciting #1482

Open Makman2 opened 7 years ago

Makman2 commented 7 years ago

Languages like python use short-circuiting, meaning condition evaluations do stop as soon as the result of further conditions is not necessary:

true = lambda: True
false = lambda: False

false() and true()  # only evaluates false(), as evaluation of true() is not necessary
true() or false()  # only evaluates true()
true() and false()  # evaluates both, true() and false()

See https://www.python.org/dev/peps/pep-0308/

Some evaluations are more costly than others, so an optimized ordering of expression could yield better overall performance.

def conditionA(x, y):
    sleep(1)
    return x == y

def conditionB(z, w):
    sleep(2)
    return z != w

# Let's take this original condition in code, and assume that we
# get a 50:50 chance for each of those condition functions to
# return True or False:
conditionB() and conditionA()

# 4 cases and their respective costs:
# True and True -> 3 seconds
# True and False -> 3 seconds
# False and True -> 2 seconds
# False and False -> 2 seconds
# Average: 2.5 seconds

So we get a discrete optimization problem to solve. Easiest is to try out every combination, in this case there are only two. Let's try the other one:

conditionA() and conditionB()

# True and True -> 3 seconds
# True and False -> 3 seconds
# False and True -> 1 second
# False and False -> 1 second
# Average: 2 seconds

Nice, this ordering yields a better overall performance.

So depending on the distribution of the input values (in this case we assumed they are distributed in a way that lets the function return in 50% of the cases False and in the other 50% True), we get different overall performances.

If a user wants to run this analysis, he has just to:

The bear now

  1. tries to calculate a performance value of each condition-evaluation (depending on the input values)
  2. from that calculates an overall performance estimation for each condition and for each input value
  3. and then optimizes the condition ordering with these performance estimations.

This is in fact a transformation from code statistics into a stochastic optimization problem.

This definitely requires coAST, and I know this is quite fancy, still I wanted to write down that idea :+1:

meetmangukiya commented 7 years ago

(Participating)

parvmor commented 7 years ago

Hey, I would like to participate in establishing this bear.

I had the following idea in mind:

  1. Firstly we would be using a LocalBear as analyzing multiple files at same time should not make any difference.
  2. We will start by finding all the lines with ' or ' , ' and ' , ' not ' and their symbols in them
  3. Then we can separate out the operands (we could use pyparse lib) and evaluate the time taken by each operand against the input distribution provided by the user. (Actually we would preprocess the operands first else we would have to run program again for every conditional operator)
  4. With that data in hand we can plugin the operands in the best possible way which would minimize the average time taken.
  5. For finding the best way we most probably won't be needing an algo assuming that in real life we write short conditional operations, else we can always have a nice algo.

Tell me if there are any mistakes or any ways to improve it. And what does coAST means?

Makman2 commented 7 years ago

Firstly we would be using a LocalBear as analyzing multiple files at same time should not make any difference.

It does. Actually we have to analyze each function or more precise, each conditional expression. They may use performance information from functions not defined in the same module, maybe even not inside the same package.

and evaluate the time taken by each operand

Looks like you want to run the snippets. That is not always a good idea, there's code that shall not run when the overall program doesn't run. Like file_io_result = delete_file() or retry_delete()

(Actually we would preprocess the operands first else we would have to run program again for every conditional operator)

How do you want to preprocess them?

For finding the best way we most probably won't be needing an algo assuming that in real life we write short conditional operations, else we can always have a nice algo.

I mean we need some kind of functionality for that, don't we? :D

And what does coAST means?

https://github.com/coala/coAST. We plan to implement a generic AST parser for all languages. So it's in principle an all-language (and maybe feature) replacement for pyparse. We have a gsoc project for that, I hope it gets accepted and we can finally start working on coAST, as this idea is around for quite a while^^

parvmor commented 7 years ago

I was thinking of taking the operand and evaluating them first and then use their result for the second time in actual conditional place. So, what i wanted to preprocess was to first go through every file and parse out the operands from every conditional operator. This way we could get the time by not disturbing the flow of program. But sometimes we use short circuiting to our advantage, we would need to handle that carefully.

As for the fifth point we can make a nested boolean tree and select the best combination while climbing up the tree.

And yes, I agree using LocalBear would be stupidity. We need some another bear that flows with the program. So for input distribution we could analyze the time by running it only once.

aptrishu commented 7 years ago

As for the fifth point we can make a nested boolean tree and select the best combination while climbing up the tree.

How do you propose to do that ?

aptrishu commented 7 years ago

@Makman2 We can easily retrieve the conditional statements and methods used in them using AST. So, if I have code like

if a() and b():
    pass
if b() or a():
    pass

You are expecting users to input the cost of both functions, and the probability that they will short circuit ? then we can evaluate each arrangement.

aptrishu commented 7 years ago

Also, there are several other issues of optimization using AST, Like

 [x*2 for x in "abc" if False] => {}
 [x*2 for x in []] => {}
Makman2 commented 7 years ago

that's a different case though ;)

You are expecting users to input the cost of both functions

Don't know, ideally we don't require that and calculate it ourselves.

abishekvashok commented 7 years ago

@Makman2 a suggestion maybe. Checking for the conditional stmts, running the functions with these conditional stmts, recording time and piping it out to a file (as sometime the log will be large). Interchanging the positions and piping it out to the same file after a line. Then printing the file like for each function and calculating it's average. It would output the log as below: (this is for a single condition)

For stmt in line <line number>, current:
True and True -> 3 seconds
True and False -> 3 seconds
False and True -> 2 seconds
False and False -> 2 seconds
Average: 2.5 seconds
After interchanging the operands:
True and True  -> 3 seconds
True and False -> 3 seconds
False and True -> 1 second
False and False -> 1 second
Average: 2 seconds

Or pipe out only the average:

For stmt in line <line number>, current average: 2.5s
After interchanging operands it becomes: 2.0s

After the file's changed its reverted back. To revert the file we can either make use of git revert and commit the change or make a temp for it and delete the file. Piping the avg seems best! Your guesses?

Makman2 commented 7 years ago

piping it out to a file

why do you want to pipe it to a file? We can directly use the values returned by timeit or similar modules.

running the functions with these conditional stmts

That's what I want to avoid, as stated from me:

and evaluate the time taken by each operand

Looks like you want to run the snippets. That is not always a good idea, there's code that shall not run when the overall program doesn't run. Like file_io_result = delete_file() or retry_delete()

Another note:

Interchanging the positions and piping it out to the same file after a line.

We just have to execute each condition once :) If we know the execution times of each single condition, we can optimize the execution time without running them again (neglecting side effects).

abishekvashok commented 7 years ago

We should depend on tests to execute the condition. Then we would not need to interchange the position.

Makman2 commented 7 years ago

We should depend on tests to execute the condition.

Not a bad idea. Though then the performance relies then on test conditions, we need to take care of that somehow^^

Then we would not need to interchange the position.

Don't understand that, the whole issue is about doing so :)