codewars / codewars.com

Issue tracker for Codewars
https://www.codewars.com
BSD 2-Clause "Simplified" License
2.09k stars 220 forks source link

Solution Execution Time Comparison #959

Open phil-the-frog opened 7 years ago

phil-the-frog commented 7 years ago

I suggest add a small statistic indication of the solutions' avg execute speed in the "Solutions" menu. And when you click the "Compare with your solution" button, it will compare your solution and the other solutions execute speeds. I feel this will make room for improvement in people ideas of optimization.

┆Issue is synchronized with this Clickup by Unito

10XL commented 7 years ago

You can fork a solution(which creates a kumite) and compare the performance of arbitrary code/solutions. Here's an example in ruby(the output isn't displaying correctly, but you can fork and run the tests yourself.)

nais86 commented 7 years ago

@10XL, you can do that yes, but maybe the suggestion is about showing the execution time, and maybe also the memory used, in every posted solution; a Performance Rating, if you will. Clever solutions, or one-line solutions, would then have a clear Performance Rating that can be compared to other solutions.

This would add a new dimension to the Codewars platform. Experienced users could then submit their solutions aimed at improving certain performance metrics (execution time, memory used, etc). Which could lead to more in-depth discussions, learning, etc.

phil-the-frog commented 7 years ago

@esoNais Yes, you hit it right on the head. Maybe performance statistics would be useful too. Like, documenting average execution time/memory used and then comparing your solution to the average for that kata would be useful as well.

Voileexperiments commented 7 years ago

Of course, then the problem is that you need to run the code a lot of times to get good results, which are often over 12 seconds.

(Statistics 101: To get a ±0.1% uncertainty, you need to run the code 10^6 times)

Also, how is runtime defined? Best case? Average case? Worse case? Mean or Median or quartiles? And how do you define an "average case"? What does a "random input" even mean for some problems?

phil-the-frog commented 7 years ago

@Voileexperiments Is 0.1% uncertainty really necessary? I'd be perfectly comfortable with even +-10%, only because it's impractical to implement the means to reach 0.1%. I personally only want the Mean of the solution times. If anyone wants anything else they will have to consider the issues of implementing such a solution.

kevin-lindsay-1 commented 5 years ago

I second this idea; I think that "scoring" could be implemented similarly to how the Zachtronics games are scored. In the Zachtronics games, generally there are 3 scores that you shoot for, those being execution time/cycles, SLOC, and "resources" (in the games there is typically a resource management system of some kind).

Their programming games are: TIS-100 Shenzhen I/O EXAPUNKS

These games are great because the scoring system pushes you to compete against yourself, and I've had a few instances where the scores I've met for certain problems aren't even on the chart because so few people have optimized as heavily, which was exhilarating.

Olangu commented 5 years ago

Just want to second this.

It's horrible to try to weigh your own solutions with other peoples solutions based only on how "beautiful" the code is.

In most cases you need to optimize either on footprint, memory usage or speed. We need those statistics on the solutions! Especially when learning a new language you can't really have a gut feeling for the performance on different ways to solve a problem.

BTW, Is there any way to check the full tests that run on the problems?

Steffan153 commented 5 years ago

@Olangu

BTW, Is there any way to check the full tests that run on the problems?

Yes. You can click fork and view the tests, open a new translation and check the tests, or click "view test cases" on the discourse page.

heribertolugo commented 5 years ago

i found this page because i was searching for this feature.. @Olangu is exactly correct. sometimes clever code, a perl looking one liner might perform best, but may also perform worst. using built in library tools (such as .net linq) may hurt or improve performance as well.

performance (speed) rating and memory usage rating should be auto calculated for each solution. and should be completely separate from users clicking like on "clever" or "best practice". many users may not even have a solid understanding of what "best practice" or "clever" would actually mean in real world scenarios. these 2 are relative to the audience, rather than to actual standards or veterans.

having performance (speed) rating and memory usage rating would be a much more accurate way to gauge ones owns code and see others that ranked higher. actually giving users a tool to use for improvement in both coding and understanding.

i created my own performance test on my local vs 10 other solutions. i was actually surprised on how well some did. i used a sample data set of 25 and ran each set in each iteration. i found that 500 or less usually gave same similar. 1000 to 10000 seemed to produce similar. anything over that, up to 500000 seemed to produce a third set of results. while the ones which performed great at over 1000, performed good below that - there were some that out performed by magnitudes but snailed over 1000.

so i would think a median of a similar approach would be reliable. or even just a flat number like 100000 iterations for different sample data sets.

nobody is expecting physics equations for performance accuracy. just some kind of benchmark for assessing a fair performance ranking.

barakplasma commented 4 years ago

In 2016, this was Codewars's ( @jhoffner ) response to a similar request: https://github.com/Codewars/codewars.com/issues/651#issuecomment-257320006

( TL:DR; "The result are just too inconsistent to rely on and if its not reliable, there is no point to having it." )

I agree with @philsop and @heribertolugo that performance results don't need to be very accurate to be useful as a way of sorting unrated solutions.

FArekkusu commented 4 years ago

performance results don't need to be very accurate to be useful as a way of sorting unrated solutions

@barakplasma the execution time is not just "inaccurate" or "inconsistent", it's completely unrepresentative of the real solution efficiency. A more or less accurate way to tell which one is faster is by checking how far they can go without timing out which would mean manually editing the test cases to run the function thousands of times, or to generate huge inputs (whichever approach makes sense for that particular case). Remaining fair when all the inputs for different solutions are different (due to randomness) is impossible, moreover sometimes the runtime will be jumping like crazy between runs because a solution can be optimized to efficiently handle one sort of data, but lose by a big margin for a different sort of data.

Everybody wants to have "execution time measurement" but nobody knows what an impossible feat it is to implement it in a remotely sensible way.

NathanHazout commented 4 years ago

Hi, I can't claim to know the inner workings of the current architecture of CodeWars, however I just want to point out that the described behavior is what LeetCode manages to do pretty well.
In LeetCode as well problems and tests are user-provided, just like in CodeWars.

Below is a screenshot of what it looks like once you've submitted the solution.
Notice the "You are here!", and you can also click on any of the bar in the chart to see a sample solution that received that score.

Screen Shot 2020-04-12 at 19 13 57 Screen Shot 2020-04-12 at 19 14 08
kazk commented 4 years ago

I can't claim to know the inner workings of the current architecture of CodeWars

Challenges on Codewars are based on unit tests unlike most of the similar platforms taking inputs from stdin. Like most of the things, both approaches have strengths and weaknesses. For example, our approach is more natural, but their approach have more control over how the solution is executed.

In LeetCode as well problems and tests are user-provided, just like in CodeWars.

I don't think so. As far as I know, their inputs are static (please correct me if I'm wrong). Their user editable input looks like the following:

[2,7,11,15]
9

When you submit, your solution is tested against multiple inputs in the same format (They probably have some code that parses the stdin and passes to the user's solution to check the output.)

So on LeetCode, and most of the other similar platforms, testing the solution works like the following:

On Codewars, the following happens:

Even though LeetCode have significantly tighter control on how solution is executed, you'll see different execution time when you submit the same solution multiple times. And it can be different enough that it sometimes tells you that your solution is faster than 99% and sometimes 65% with the same solution.

On LeetCode, this is not a huge deal because it's not too obvious since you can't directly compare with other users (they have a "sample"). But imagine how you'd feel if you saw that someone else at 99% while you're at 65% with the same approach. Also, I don't know how they deal with changes in their environment that affect performance (e.g., language version, VM performance, changes in inputs, etc.), but all of the past submissions needs to be reevaluated when that happens.

The only way to minimize the divergence is to do proper benchmarking and this can be extremely expensive. I don't think LeetCode does it and I don't think any other platform does it for their free services.

Showing the total test time of the solution at the time of the submission is not difficult, but it's not really useful number and it can be extremely misleading.

So we'll need at least the following:

kazk commented 4 years ago

If we ever do something to compare performance, I'd probably try something that estimates the time complexity and classify solutions based on that instead of the actual time. This should be relative to the current environment so we don't need to redo whenever we make changes. I don't know exactly how to do it yet (maybe comparing the performance against fixed inputs of varying sizes will get us far enough).

Codility seems to take this approach in their reports for example: image

Voileexperiments commented 4 years ago

One thing I don't understand in this frequently recurring topic is why lots of people involved have clear obsession over performance (especially runtime) while displaying a uniform, indiscriminating disgust of "one-liner, ''''''''clever'''''''' solutions".

Ranking everything blindly is easy. Making it meaningful and informative is really, really hard. So far I only see all these people making suggestions of how to do it (as if it's implying "look, making a chart is easy, WHY AREN'T YOU LAZY BUMS DOING IT?"), but not why to do it, or whether it should be done:

  1. Running code on a platform is not like your typical video game speedrunning where if you make the same inputs, it'll always end up on the same frame. All modern CPU has scheduling. The only chance you're going to get that is with a MCU like Arduino, but good luck with that, it's literally a potato
  2. Random tests essentially mean the runtime will be different every run, so the statistics argument still applies. 0.1% is barely enough when you're talking about the top end where many solutions are piled up together very closely. And this brings to the next point:
  3. Benchmark requires clearly defined, objective test cases (or "categories" in speedruns). How are you defining test cases for every kata? The nature of the test cases will decide what is the fastest. A single "time" does not mean anything if you don't bring up the test cases used in measuring. And who's deciding the test cases? What counts as "sufficient test cases"? What should be tested? What about edge cases? So many questions, and none addressed by these people!
  4. And even in the end, I struggle to see how ranking tons of very, very close solutions together brings upon any meaning. Nobody gives a damn to micro-optimization compared to obvious complexity pitfalls and failures. If you don't like one-line ''''''''clever'''''''' solutions because they have poor time complexity, a runtime ranking chart isn't what you want. You should request something actually useful for this purpose, like "detected time complexity".
  5. The tone of some of these people are honestly giving the vibes of "we should get rid of one-line ''''''''clever'''''''' solutions because I don't like it". That's really, really toxic. There are a million ways of writing code to solve the same task, and they will have compromises and relative strengths and weaknesses.

So until the stances of the above-mentioned people are resolved, I propose we ignore whatever they says for the sole reason that they have clear bias and are not aimed at making things helpful compared to something like a "detected time complexity" feature. TBH it's just a nuisance at this point.

barakplasma commented 4 years ago

Just to add some background information:

When a user pays for Codewars Red, the user receives a head-to-head comparison feature. Part of the head-to-head comparison is a runtime speed comparison.

The following image is a screenshot from my view of a head-to-head comparison, since I pay for Codewars Red. Look below the solution where it compares runtime for this kata. Screenshot_20200413-134732_Chrome Link to head to head page for those with Codewars Red

So this information is already collected, just not visible to everyone.

NathanHazout commented 4 years ago

I was not aware of the randomness of the tests, that certainly makes it more complex.

@Voileexperiments Regarding the "why", I'll explain my experience. I used LeetCode, and other such websites, to train for technical interviews. And there, performance is VERY VERY important, in most algorithms questions at least.

Now you are right that I don't really care about the exact number of milliseconds, or if I'm faster than some other guy, but time complexity yes! I just assumed measuring time is easier than detecting the time complexity. Maybe this assumption was wrong. I'd be really happy if I could get complexity feedbacks compared to other solutions, even better than exact timing.

I don't think ignoring is the right approach, understanding the users' persona and requirements is the goal, ask your Product Management ;)

Voileexperiments commented 4 years ago

@nasht00 Yes, I know all of this comes from leetcode, and no, it's not commonplace at all, nor is it representative to its actual usage. If you ask people in typical OJ platforms, or some OI/ACM participants, they'll tell you its main usage is to compare very large time differences (e.g 8 seconds vs sub-second), not the tiny ones due to micro-optimizations.

I don't think ignoring is the right approach

Not when the same dead horse has been beaten 1000 times in the past with all kinds of loaded requests of "please do this so one liner, ''''''''clever'''''''' solution would stop appearing among the solutions". It's extremely obnoxious and unhelpful.

hobovsky commented 4 years ago

I can see another potential problem with this feature:

User1 submits their solution, and gets execution time measured as 100 ms. User2 submits their identical solution and execution time gets measured as 150 ms. They wonder what's going on, resubmit, and now they get 200 ms. Now I can see legions of noo newbies running around totally butthurt and complaining "booo hoooo LeetH4xX0r's solution is ranked at position 53 and mine is 376 and they are identical, whyyy!!1!!!1" I would not like this feature.

What I would like, tho, is to somehow enforce particular performance constraint on kata, even the easy ones. Make "Sum of numbers 1...n" fail with non-O(1), or "Sum of multiplies of 3 or 5" fail with O(n). However the problem is that this makes (would make) a lot of people complain about easy kata being too difficult (see "Is a number prime?") or it would require creating another version of kata, and effectively creating almost-a-duplicate. Tiered kata would be a nice solution to this problem, I think.

kazk commented 4 years ago

So this information is already collected, just not visible to everyone.

Yes, it's recorded at the time of the submission, but like I wrote above, it's not useful number. I wouldn't have included it in head to head compare either, but I wasn't involved back then.

And there, performance is VERY VERY important, in most algorithms questions at least.

There are some kata that have performance requirements. Not all of the kata needs it.

Now you are right that I don't really care about the exact number of milliseconds, or if I'm faster than some other guy, but time complexity yes! I just assumed measuring time is easier than detecting the time complexity. Maybe this assumption was wrong.

It is very easy to show some number (I bet many won't notice even if I just did randint(100,900)). It is difficult and resource intensive to show meaningful number. Detecting time complexity is not as easy, but it'll be useful and more stable metric.

However the problem is that this makes (would make) a lot of people complain about easy kata being too difficult (see "Is a number prime?") or it would require creating another version of kata, and effectively creating almost-a-duplicate. Tiered kata would be a nice solution to this problem, I think.

Yeah, tiered/sequential kata can be nice to have. At Qualified, we have something like that. A candidate is required to solve a challenge and the subsequent challenge adds more requirements using the previous solution as a starting point.

Another issue with having easy and hard version is spoilers. Sequential kata will prevent this too, by requiring one to solve the easy one first to proceed and not allowing to see the solutions until all of the kata in the sequence is completed.

NathanHazout commented 4 years ago

Fair enough about exact timing.
Is there a separate issue tracking the idea of time complexity? I'd upvote that one instead.

Again, I don't care about "please do this so one liner, ''''''''clever'''''''' solution would stop appearing among the solutions" - not my case here.

Neither is "why am I ranked better than this guy" - I don't care either.

I only want to categorize the solutions, to know if my solution meets the expectations or if I'm very far off.

Steffan153 commented 4 years ago

Detecting time complexity is not as easy, but it'll be useful and more stable metric.

That would indeed be useful, I'm terrible at detecting that :)

But that's a different issue, right?

If you do implement this, however, you can't do it for all the languages, tho, right? Just maybe JS and Python?

kazk commented 4 years ago

I only want to categorize the solutions, to know if my solution meets the expectations or if I'm very far off.

If you passed the tests in time, then your solution met the expectation. Try more difficult kata that requires performance.

But that's a different issue, right?

I'm just saying that's probably what I'd do if we're going to do something to classify solutions by performance. We can either close this one and open a new issue for it or change this issue to a more generic request.

If you do implement this, however, you can't do it for all the languages, tho, right? Just maybe JS and Python?

Why do you think it's only JS/Python?

Anyways, it's not on our roadmap at the moment and it's not going to happen anytime soon.

Steffan153 commented 4 years ago

It depends on what you use for detecting time complexity, if you find some library, do you really think it's going to support 40 languages?

kazk commented 4 years ago

No? I don't know why we need a single library that supports 40 languages. That's like saying we can't test different languages because they all have different test frameworks. We don't need to add it to every language for every kata at once.

Like I wrote above, I don't know exactly how to do it yet. But each language can have its own ways as long as it can collect information and report back. It might be similar to how we handle the tests, but we get some data back instead of the test result. Again, I don't know about the details yet, but I don't think we need one library that supports many languages.

ssnyder-intrinio commented 2 years ago

I think finding and classifying on the big O for an answer would be a better persuit than actual execution times because of the variability. This could involve scaling the inputs at different points and doing a best-fit curve. It doesn't necessarily need to rank off that (but could), but showing the big O under the answer would be informative.

arjuna-dev commented 9 months ago

I'd also like to see this feature! It is interesting how in leetcode sometimes you can exceed the time limit if your algorithm is not perfomant enough. I also agree that the way to go is to classify answers by big O and not time of execution. Would be great!

Blind4Basics commented 9 months ago

It is interesting how in leetcode sometimes you can exceed the time limit if your algorithm is not perfomant enough.

The time limit in CW is the very same thing...

arjuna-dev commented 9 months ago

It is interesting how in leetcode sometimes you can exceed the time limit if your algorithm is not perfomant enough.

The time limit in CW is the very same thing...

Oh, sorry, I didn't know it also existed, didn't come across that yet

arjuna-dev commented 9 months ago

Maybe this is a good use case for a chatGPT API or other LLM implementation. I tried it using the API playground with gpt4-1106-preview:

I asked:

Can you give the big O for the given solutions of a cpp programming problem? This is the problem:

"The goal of this exercise is to convert a string to a new string where each character in the new string is "(" if that character appears only once in the original string, or ")" if that character appears more than once in the original string. Ignore capitalization when determining if a character is a duplicate.

Examples
"din"      =>  "((("
"recede"   =>  "()()()"
"Success"  =>  ")())())"
"(( @"     =>  "))((" 
Notes
Assertion messages may be unclear about what they display in some languages. If you read "...It Should encode XXX", the "XXX" is the expected result, not the input!"

This is solution 1:

#include <string>
using namespace std;

string duplicate_encoder(const string& word){
  unordered_map <char, int> char_map;
  string result;
  char state;
  for (size_t i=0; i<word.size(); i++) {
    char letter = tolower(word[i]);
    if (char_map.find(letter) != char_map.end()){
      int first_app_index = char_map[letter];
      // Check if it is second appearance (only one before)
      if (first_app_index != -1){
        result[first_app_index] = ')';
        char_map[letter] = -1;
      }
      // Repeated
      state = ')';
    } else {
      // Single
      state = '(';
    }
    char_map[letter] = i;
    result += state;
  }
  return result;
}

This is solution 2:

#include <string>
#include <cctype>

std::string duplicate_encoder(const std::string& word){

    std::map<char, int> table;
    for(auto x : word) table[std::tolower(x)]++;

    std::string result;
    for(auto x: word) result += (table[std::tolower(x)]==1)? "(":")";

    return result;
}

This is solution 3:

#include <string>
#include <algorithm>

std::string duplicate_encoder(const std::string& word){
    std::string encodedWord, comparisonWord;
    std::transform(begin(word), end(word), back_inserter(comparisonWord), ::tolower);

    std::transform(begin(comparisonWord), end(comparisonWord), back_inserter(encodedWord),
            [comparisonWord](char a) {
                int numDuplicates = count_if(begin(comparisonWord), end(comparisonWord), [a](auto elem) {return elem == a;});
                if (numDuplicates > 1) return ')';
                else return '(';
    });
    return encodedWord;
}

This is solution 4:

#include <string>
#include <cctype>
#include <limits>
#include <array>
#include <numeric>

std::string duplicate_encoder(const std::string& word) {
  std::array<int, std::numeric_limits<char>::max() + 1> charCounters{0};
  for(const auto& c : word) charCounters[std::tolower(c)]++;

  return std::accumulate(word.begin(), word.end(), std::string(), 
    [&charCounters] (const auto& sum, auto c)
      { return std::move(sum) + (charCounters[std::tolower(c)] > 1 ? ')' : '('); });
}

Also explain any other performance traits that you might find in the different solutions.

The answer was:

Let's analyze the time complexity for the provided C++ solutions.

Solution 1

string duplicate_encoder(const string& word){
  unordered_map <char, int> char_map;
  string result;
  char state;
  for (size_t i=0; i<word.size(); i++) {
    char letter = tolower(word[i]);
    // ... Rest of the code
  }
  return result;
}

Big O Analysis:

  • Time Complexity: O(n), where n is the length of the input word string. The algorithm loops once through all characters, and each individual operation inside the loop (character lookup, insertion, and transformation) has an average time complexity of O(1) using std::unordered_map.
  • Space Complexity: O(k), where k is the size of the character set involved in the input string. It is common to limit k to a constant for character encoding (such as ASCII or extended ASCII).

Performance Traits:

  • The unordered_map uses hashing, which provides average constant time complexity for insertion and lookup.
  • The solution maintains the indices of characters in the map and uses them to update the result string accordingly.

Solution 2

std::string duplicate_encoder(const std::string& word){
    std::map<char, int> table;
    // ... Rest of the code
}

Big O Analysis:

  • Time Complexity: O(n log n), where n is the length of the input word string. The std::map is typically implemented as a red-black tree, which has O(log n) complexity for insertions and lookups. The algorithm has two loops over the word, each taking O(n log n) time.
  • Space Complexity: O(k), where k is again the size of the character set involved in the input string, though typically not larger than n.

Performance Traits:

  • The std::map is ordered, which results in higher complexity for insertion and lookup compared to std::unordered_map.
  • The solution uses two passes: one to count occurrences, and another to assemble the resulting string.

Solution 3

std::string duplicate_encoder(const std::string& word){
    // ... Rest of the code
}

Big O Analysis:

  • Time Complexity: O(n^2), as std::transform is called twice with O(n) complexity, and inside the second transform's lambda function, std::count_if is called which also has O(n) complexity and it's invoked for each of the n characters.
  • Space Complexity: O(n), as it stores an additional lowercased copy of the string (assuming worst case where all characters are unique).

Performance Traits:

  • Not efficient because of the nested loop effect (std::count_if inside the std::transform), resulting in quadratic time complexity.
  • Constructs a lowercased copy of the word which occupies additional space.

Solution 4

std::string duplicate_encoder(const std::string& word) {
  // ... Rest of the code
}

Big O Analysis:

  • Time Complexity: O(n), where n is the length of the input word string. The array charCounters allows O(1) time access to character counts. The std::accumulate function processes every character exactly once.
  • Space Complexity: O(1), as the size of std::array is constant and based on the character range, not affected by the size of the input.

Performance Traits:

  • Uses an array to store counts which is very efficient in terms of access time and memory locality.
  • The std::accumulate function is used creatively to build the output string in a functional style.

From the analysis, solutions 1 and 4 are the most efficient in terms of time complexity with Solution 4 likely being efficient in terms of both time (due to the constant size of the array and therefore better cache locality) and space complexity. Solution 2, while more inefficient, will work fine for short strings. Solution 3 has the highest time complexity and should be avoided for large strings.

AmirTallap commented 6 months ago

How many languages on codewars do we have? 43? If we can make these as a separate service, and someone test all problems with all working solutions to produce the real big o notation regarding space & time used to execute, then that's a plan, I don't how bad is it on the current arch, but doing this as a separate service is doable? What do you think?

AmirTallap commented 6 months ago

OK after reading the entire thread, and going through every nitty gritty, I will share why I need this feature, for sometime in the thread I thought this is a sales strategy to subscribe to codewars red, yet the main problem remains!! Head to head comparison doesn't give me accurate info where my solution stands with others. And replying to @Voileexperiments , I've been involved with projects where not considering space & time complexity, or the big O notations would cost Megawatt/h losses!! I'm not here to compare or scream, hey I'm better than someone else!! No, every problem is unique, heck I knew about two pointers approach or the sliding window is better than brute force because of leetcode one day! There are other things some people, including me personally want to learn and understand, it doesn't work that way, and people will whine weather you like it or not, this is human nature!!! As for microseconds differences, I guess hitting re-run will explain to a newbie that resources might be busy?! Even these differences can be narrowed down using guaranteed resources, and in FIFO execution model! As for random test cases we can take the average, even if the input is very very big! we can classify them. Heck we can even re-write the entire kata's, more people are willing to contribute to this, including me. I think shutting down discussion on this is not the brightest decision. And thank you @NathanHazout for this great comment