Opinions/Help for Code Execution

tlyi commented 1 year ago

Hi guys! Have been wrecking my brain for this so would appreciate any insights/thoughts T__T

Current code execution flow: Step 1 Input arguments are parsed from the sample examples given in every question, and the "correct" types are deduced from this these. E.G: This will be parsed (purely from analysing the string value) and initialised by language:

Python: input = ['h','e','l','l','o'] Javascript: const input = ['h','e','l','l','o']; C++: int input[] = {'h','e','l','l','o'}; Java: String[] input = ['h','e','l','l','o'];

Step 2 User writes whatever code is needed to define their function. They then call the function assuming the variable input, get the return value, and print it to stdout. Why I do not automate the calling and printing process for them: For stricter languages like Java and C++, I must know the exact input and return types to call the function...

Step 3 We will take the user code, consider each test case, repeat the parsing process assuming the "correct" input type from (1), and behind the scenes, insert the code required to initialise the variable input into the user's code, so that the variable input is available for them to use. Finally, we send this complete code to the code judge API (judge0).

Note that in the template code, the comments on the top will be customised according to how we are parsing the input of the Example 1 test case in the question so that the user knows what variable is available for their use.

E.G: Testcase string: "['H','a','n','n','a','h']" User's code:

"""
This is how the code executor will process the input from testcases:
E.G: Example 1
input = ['h','e','l','l','o']
Please reserve these variables and 
use them in your functions where necessary.
"""

#TODO: change the function arguments below 
def reverseAString(arr):
    return arr[::-1]

#TODO: call your function and print its output below
print(reverseAString(input))

What we send to the API:

input = ['H','a','n','n','a','h']

"""
This is how the code executor will process the input from testcases:
E.G: Example 1
input = ['h','e','l','l','o']
Please reserve these variables and 
use them in your functions where necessary.
"""

#TODO: change the function arguments below 
def reverseAString(arr):
    return arr[::-1]

#TODO: call your function and print its output below
print(reverseAString(input))

Step 4 The code execution API will then run this code as a script, and (amongst other outputs like status and errors), return the stdout.

Step 5 Finally, for test cases that we know the correct ans for (the ones that we inserted ourselves), we compare their answers to the expected answers to see if it is correct:

For the test cases that are custom, we simply tell them whether the code has been executed successfully, and return the console output:

While this seems like it works fine, I realised that there are some problems for determining if an answer is "correct": A1. Regardless of the actual output type, we only have 2 strings to work with - 1 from the expected output, and 1 from stdout. This means that the types are not taken into account, and just because the strings are equal doesn't necessarily mean the answer is correct. A2. String comparison fails once the correct answer actually allows for some flexibility. Eg for the question "Repeated DNA Sequences", the question states that "You may return the answer in any order.". While it is possible to do some post-processing on the stdout string to account for this, it is quite impossible for the parser to deduce whether an array can be in any order since there is limited information available from the question database.

More importantly, I realised that there are BIG problems once we work with languages other than Python/Javascript: B1. Printing of arrays is tedious for Java, and super tedious for C++. In Java, you need to call .toString(), and in C++, you need to loop through it and call cout. Not sure if this is the most user friendly way to do it... B2. Even after the user prints the array in C++, the information available from stdout is vague -- "123" will be printed the same way as {"1", "2", "3"}

Basically, I find that it is quite difficult to implement checking of "correctness", because without an actual solution code to compare the expected return types and values, there are cases that string comparison cannot do the job.

My suggestion is to claim that we can only check for correctness for Python and Javascript (even though there are cases where even this approach will fail with these languages). For other languages, the code execution feature is simply a script runner... Not sure if you guys are ok with this??

Anyways, I might be approaching this completely wrongly, so let me know if you guys have any suggestions... Thank you :"D

c0j0s commented 1 year ago

Seems like code execution is more complicated than we anticipated. Generally, I think your approach seems logical and well thought out, solid stuff so far 👍

The major roadblock right now is the argument passing for static-type languages. not sure if you are aware, cpp and java seem to support type inferencing in their newer versions. I haven't tried them out before, so uncertain whether it suits our scenario. Another way could be adding code execution-specific metadata for questions, like arg type or size, etc. I'm pretty sure Leetcode does that in the background.

As for verifying the correctness of the results, we could modify our method in the default boilerplate to return specific types? so instead of asking the user to cout the value, their solution should return a value with the correct type. Again, I'm sure Leetcode has question-specific test driver codes to verify the correctness, plus they have the solution too.

Considering the unanticipated complexity, I recommend that we reconsider the implementation of code execution. Given the substantial effort and time remaining, I suggest we move on to other closure activities like code polishing, testing, and documentation instead. Plus, we already have enough nice-to-have features.

If we are committed to it, I guess we'll need to involve additional members; it's definitely not a one-person job.

What do you guys think?

tryyang2001 commented 1 year ago

Thank you so much for working on the complex logics for code execution service! Consider the time remaining, we might not be able to achieve full functionality for the code execution. But I think it will be better to have back-up nice-to-have features and this code execution is very cool!

So to simplify the problem, I have a few suggestions:

For stricter languages like Java and C++, I must know the exact input and return types to call the function...

For this, I suggest we only provide detailed support for Python and Javascript, then we don't need to worry about whether we need to follow those stricter languages execution code flow... As the project only asked for a working MVP, so it is fine to only support Python and Javascript, and we can justify this design choice because we think Python is more popularly used in solving technical assessments and easier to deliver considered the time constraint 😢 .

just because the strings are equal doesn't necessarily mean the answer is correct.

This is fine for me. As we don't have a test case database, we can't really justify if a user code is working. We just need to verify if the actual output matches the expected output. What do you think?

3.it is quite impossible for the parser to deduce whether an array can be in any order

Ah I see... This is tough. I think, how about we just don't justify if the user output is correct or wrong? As in we just provide the functionality to execute the code and that's all?

My suggestion is to claim that we can only check for correctness for Python and Javascript (even though there are cases where even this approach will fail with these languages). For other languages, the code execution feature is simply a script runner... Not sure if you guys are ok with this??

Ah just saw this after going through the above questions xD. OK then I totally agree with you that let's don't check the correctness, or perhaps just check for those that can be done easily. So if you agree with the former, then we just provide code execution functionality and that's it, whether the code can compile, is correct or not, is totally up to the user (though we probably need to handle some errors). If we go with the latter, we need to carefully filter which questions can be checked, which will be more time-consuming.

Once again, thanks so much for going through all the thought process and implementation! Respect!!!

tryyang2001 commented 1 year ago

I suggest we move on to other closure activities like code polishing, testing, and documentation instead

Yes, I agree with you that we should start working on all these. But I think we can let leyi completed her code execution feature while others start working first. Since she has been spending 2 weeks working on the feature, and I think it is almost done if we simplify our requirements.

tlyi commented 1 year ago

not sure if you are aware, cpp and java seem to support type inferencing in their newer versions

Oh damn I actually didn't know about this HAHA thanks!! That seems to make things easier...

Another way could be adding code execution-specific metadata for questions, like arg type or size, etc. I'm pretty sure Leetcode does that in the background.

Again, I'm sure Leetcode has question-specific test driver codes to verify the correctness, plus they have the solution too.

Yup! I think Leetcode does this for every question and language. I wanted to implement code execution without making major changes to our question database, especially because it will be difficult to update manually once there is a big stream of new questions coming in from the web scraping feature. Plus, I personally do not think the additional effort is worth it for this feature :P

Considering the unanticipated complexity, I recommend that we reconsider the implementation of code execution.

Me too, but I would suggest we keep it the way it is now. I just took another look at the definition for the feature, and it seems to only ask for code compilation. I think it is good enough that we have compilation + limited support for test case automation :P

tlyi commented 1 year ago

OK then I totally agree with you that let's don't check the correctness, or perhaps just check for those that can be done easily.

I think i'm leaning towards the latter option, just for Python and Javascript. As for questions with additional flexibility, I guess the users can just double check their answers manually?!?! HAHA :P

Just wanted to know if you guys are okay with this looser definition for code execution!

weidak commented 1 year ago

thanks for all the hard work though so far though 😓! I was really not sure how complex this was when u brought this up to me on thursday but wow now that u typed it out... there are endless possibilities of how this checking of correctness may go wrong :O I mainly align with Rui yang that we should simplify the requirements and work towards that direction as this nice to have looks way too huge to be implemented + tested + deployed in 2 weeks.

My suggestion is to claim that we can only check for correctness for Python and Javascript (even though there are cases where even this approach will fail with these languages).

A potentially large issue is how this correctness will be integrated together with the assignment 6, where we are able to populate our database with the serverless function. A use case of our website can be:

Scrape new question
Match with partner in Python/JS with new question
Code execution

and if any of these parts fail (especially since we claim python/js works), we may be penalized for this.

For other languages, the code execution feature is simply a script runner... Not sure if you guys are ok with this??

Personally, I am all for this as thats the requirements of the nice to have...

tryyang2001 commented 1 year ago

If we go with:

Scrape new question
Allow code execution with any language that we supported, but only as a simple script runner.

Then we don't need to care about if there is any failure, or whether the input/output matches the expectation, as it is up to the user to let the code works correctly as intended. Then we won't get any penalisation if there are any bugs.

@weidak actually makes a very good point, if we allow auto question population without human intervention to check, then there might be quite a number of unseen bugs for checking correctness. If you really want to check code correctness for Python and Javascript, we might need to carefully design our logic and able to deal with any circumstances. Not sure if this is hard or not, FYI @tlyi .

tryyang2001 commented 1 year ago

Thanks for the great work. Closing this as we have merged code execution to master!

CS3219-AY2324S1 / ay2324s1-course-assessment-g05

Opinions/Help for Code Execution #130