CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Reddit #11

Open taisazero opened 1 year ago

taisazero commented 1 year ago

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

Final Data Format inside text

[Context]:
    "Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
    using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
    [context/2]:
        Could someone there post a summary of the insightful moments.
    [context/1]:
        Basically L2L is the new deep learning.
    [context/0]:
        What's "L2L" mean?

Other features:
    [context_author]:
        goodside
    [response_author]:
        NetOrBrain
    [subreddit]:
        MachineLearning
    [thread_id]:
        5h6yvl
taisazero commented 1 year ago

Cheesy#0202 (Me) Working with jesse#7865 and Eleuther to obtain Pushshift Reddit data.

ncoop57 commented 1 year ago

To find relevant reddit communities, we can look at awesome lists: https://github.com/learn-anything/reddit#linux

taisazero commented 1 year ago

Currently obtaining dayta pogchamp!

ncoop57 commented 1 year ago

@taisazero please add this information to the issue description:

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
taisazero commented 1 year ago

Filtering Resources

Toxic Content

Toxic Subreddits from (Gehman et al., 2020)

Looked also into DialoGPT's excluded subreddits but it was empty: DialoGPT Subreddit blocklist

Low-Quality Content

  1. The author is a known bot.
  2. It comes from a known non-English subreddit.
  3. The comment is marked as removed/deleted.
  4. It is longer than 2048 characters and does not contain spaces.
  5. It is longer than 128 BPE tokens.
  6. It is shorter than 5 characters.
  7. It contains a URL.
  8. It starts with a non-ASCII character.
  9. It is further than depth 7 in the thread. From (Roller et. al, 2021)