Closed faraday closed 1 year ago
I'll add a commit to represent in lm_dataformat
@faraday can you add me into your repo? I will modify URL to S3 urls
@PhungVanDuy I just added you. Thanks
Thank you so much, please update the code if have any new updates. We going to review PRs and merge soon. Thank you so much :)
@faraday can you enable me to make changes by checking this off please? https://github.blog/2016-09-07-improving-collaboration-with-forks/
So far, it looks great!
@ncoop57 I just added you to my fork (I read the article and thought you needed me to add you).
Thanks so much @faraday !
First download leetcode.tar.bz2 (502 MB) that includes these files when unpacked: questions.jsonl 8.1 MB topics.jsonl 3.47 GB comments.jsonl 450 MB comment_replies.jsonl 233 MB
Group and collapse replies to their comments, comments to their topics.
Join topics and their questions.
Save this final table in Parquet format.
Addressing: https://github.com/CarperAI/Code-Pile/issues/7 https://github.com/CarperAI/Code-Pile/issues/8
Statistics about the LeetCode data:
Snapshot date (until): 2022-09-26 Questions: 2421 records (7.8 MB as JSONL, 1.2 MB bzipped) Discussion topics (pseudo-solutions): 2351568 records (3.2 GB as JSONL, 405 MB bzipped) Comments: 525802 records(430 MB as JSONL, 46 MB bzipped) Comment replies: 293361 records (222 MB as JSONL, 26 MB bzipped)