CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Programming Contest Sites #7

Closed PhungVanDuy closed 1 year ago

PhungVanDuy commented 1 year ago

Title

Dataset URL - Collect Dataset from Programming Contest Sites Does the dataset exist in a scraped format? Exists some resources available like CodeContest From DeepMind, APPS, LeetCode.

Description

Code Data from Competitive Programming Pages is a high-quality resource for code generation. Websites like Codeforces, AtCoder,... provided good resources about competitive programming problem and code.

Procedure

Dahoas commented 1 year ago

Seems like CodeContest is missing TopCoder so perhaps we should scrape TopCoder ourselves. Also Hackerrank

PhungVanDuy commented 1 year ago

Seems like CodeContest is missing TopCoder so perhaps we should scrape TopCoder ourselves. Also Hackerrank

Yes, I will try to crawl data if needed. Thank you, that's a good point!

Dahoas commented 1 year ago

Seems like CodeContest is missing TopCoder so perhaps we should scrape TopCoder ourselves. Also Hackerrank

Yes, I will try to crawl data if needed. Thank you, that's a good point!

No worries I have a topcoder crawler!

PhungVanDuy commented 1 year ago

Great! Any PR?

PhungVanDuy commented 1 year ago

Just check HackerRank and LeetCode, programming problem we can scrape but code solutions it's seems not available.

PhungVanDuy commented 1 year ago

@Dahoas any update on crawling Topcoder?

PhungVanDuy commented 1 year ago

DeepMind CodeContest Dataset is available under EleutherAI/lm_dataformat. Each document included: the problem description, test cases, solutions, and incorrect solution code.

Please check this link.

faraday commented 1 year ago

@PhungVanDuy For LeetCode, maybe we can scrape solutions from discussions, filtering by votes. This issue is set to track multiple contest sites. It maybe a better idea to track separately if we are to do post-processing.

PhungVanDuy commented 1 year ago

That's a good idea. Let me check it. Thank you so much!

PhungVanDuy commented 1 year ago

I just figure out the way to crawl data from Topcoder:

Note below:

  1. Access this link to get all problem statements: https://www.topcoder.com/tc?module=ProblemArchive
  2. With one problem, go down we will see the link lead to the contest that problem belongs to (see image 1) Screen Shot 2022-09-23 at 14 44 16
  3. Jump to that link we will get the source code solutions
Screen Shot 2022-09-23 at 14 52 29

Source code like this:

Screen Shot 2022-09-23 at 15 05 35

DeepMind CodeContest does not have problems from Topcoder if we get one it is also a good one to contribute to the Programming Contest dataset.

As I estimated (by estimate counting contests) have more than 5000 problems in Topcoder, just worry about the license. Following Topcoder statement: This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2022, TopCoder, Inc. All rights reserved.

Any comment for this @ncoop57.

ncoop57 commented 1 year ago

It looks like we need to email them to see there full policy. @PhungVanDuy Lemme contact someone to see how we can reach out to them about this

PhungVanDuy commented 1 year ago

It looks like we need to email them to see there full policy. @PhungVanDuy Lemme contact someone to see how we can reach out to them about this

Great! Meanwhile I still do crawl this one Louis suggest in bad cases we can index but keep data for us to finish First Version of Competitive Programming Dataset. Thanks

faraday commented 1 year ago

I started working on LeetCode.

PhungVanDuy commented 1 year ago

@ncoop57 PR https://github.com/CarperAI/Code-Pile/pull/24 for TopCoder scrape dataset, I need to run on batches to get full data, and integrate with CodeContest before then.

ncoop57 commented 1 year ago

@PhungVanDuy @faraday @Dahoas please add the following information to the issue description:

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
PhungVanDuy commented 1 year ago

@ncoop57 Happy to share that the Competitive Programming Dataset for Code Pile is finalized the first version included CodeContest (train dataset only to prevent fairness in the later benchmark) and TopCoder, which totally we have 18267 problems with solution code, with CodeContest we also have incorrect solutions as well. I believe this is the largest CP dataset currently. I pushed the parquet file in this. Kindly check this and give me a comment if have any concerns with this dataset.

ncoop57 commented 1 year ago

@PhungVanDuy awesome job Duy! One thing though, when you store the data in parquet files, please store the metadata as separate columns instead of in the HTML-style tags as the parquet is an intermediate storage for now until it gets converted into the final jsonl format to be fed to the large language model. So, for example, the parquet should have a format like this:

title source tags difficulty problem hint solutions ...
row1 row1 row1 row1 row1 row1 row1 ....

Similar to https://huggingface.co/datasets/deepmind/code_contests#data-instances

This parquet will later be converted into the format you currently have with the HTML-style tags. Sorry for the confusion 😅

PhungVanDuy commented 1 year ago

@ncoop57 I got it, so do you want me to make lm_dataformat for html-styles also besides parquet like above?

ncoop57 commented 1 year ago

not quiet yet @PhungVanDuy as during our weekly meeting we are going to finalize the format, so it might change slightly

PhungVanDuy commented 1 year ago

not quiet yet @PhungVanDuy as during our weekly meeting we are going to finalize the format, so it might change slightly

Got it, no worries I have the function on code to make HTML so we can easy to change.

P/S: Hint is a new field that I combined from my research before (I collected tutorials from Codeforces and write in latex form), we can note this field to explore how tutorials affect models later.

PhungVanDuy commented 1 year ago

@ncoop57 Updated files at this link, I also refactor CP Dataset (without scrape on this file, just fetch dataset from Gdrive), only Topcoder is had scraper code. I also added unittests.py file and dummy data.

PhungVanDuy commented 1 year ago

@faraday you can make some updates about LeetCode here to keep tracking.

faraday commented 1 year ago

Here are some statistics about the LeetCode data:

I have used asyncio and aiohttp in scraping code but server-side rate limits are bottleneck in downloading topic data. I think it's best to grab the snapshot (time-tagged with 20220928) and shape the scraper as a delta-update from there.

faraday commented 1 year ago

As an example, Two Sum has around 11k related topics, many of them people sending their solutions in different languages (probably covering most of programming languages). If we are to represent topics as solution list together with the question, I don't think databases could cope well with that. We'll discuss more about this, noting it here. We should probably aim for representing it as Forum as you previously suggested.