Open taisazero opened 2 years ago
Cheesy#0202 (Me) Working with jesse#7865 and Eleuther to obtain Pushshift Reddit data.
To find relevant reddit communities, we can look at awesome lists: https://github.com/learn-anything/reddit#linux
Currently obtaining dayta pogchamp!
@taisazero please add this information to the issue description:
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
Toxic Subreddits from (Gehman et al., 2020)
Looked also into DialoGPT's excluded subreddits but it was empty: DialoGPT Subreddit blocklist
Programming & Computing Sub-Reddits
Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!
Does the dataset exist in a scraped format ?
No, we need to format them into a dialogue format.
Description
Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.
Procedure
{"text": string, "meta": obj}
lm_format
scriptFinal Data Format inside
text