LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.99k stars 3.23k forks source link

multi language code competiton / challange dataset #391

Closed extreme4all closed 1 year ago

extreme4all commented 1 year ago

similar to #330

Applications like codewars & leetcode give you prompts with a use case, sample data & expected results, they have this for many languages, you can also see the working results, they track the function runtime & memory usage.

What would be required:

what would be Optional to measure:

Why is this useful?

language models are used to create & explain code, by giving it prompts with many working implementations we can let the assistant generate:

extreme4all commented 1 year ago

semi related to #279

huu4ontocord commented 1 year ago

Hi are you proposing to actually hold a competiton, or scrape data from a competition?

theblackcat102 commented 1 year ago

@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?

extreme4all commented 1 year ago

Hi are you proposing to actually hold a competiton, or scrape data from a competition?

both, maintaining our own dataset would probably have the highest quality for the purposes of the project. initially questions & solutions can be scraped from competitions, but do they allow us to do this

@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?

i don't know of any public dataset, what is codefor?

extreme4all commented 1 year ago

@theblackcat102 i found this dataset: https://github.com/codereport/LeetCode

momegas commented 1 year ago

Also this one looks promisisng. https://huggingface.co/datasets/bigscience/xP3/tree/main/code

huu4ontocord commented 1 year ago

@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.

huu4ontocord commented 1 year ago

@momegas another issue is dealing with xp3. but no one has been assigned it. Would you be interested in working on it?

momegas commented 1 year ago

Im not sure what the outcome of this issue is though. A well curated dataset uploaded to a Hugging Face repo?

extreme4all commented 1 year ago

@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.

indeed i think it would be valuable to have such dataset in instruction -> answer form.

And indeed validating if the code is executable and meets some tests is required

zirui commented 1 year ago

i have two ideas:

olliestanley commented 1 year ago

We are now using code contest and leetcode data