CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

LinusTechTip Programming Forum #32

Open PhungVanDuy opened 1 year ago

PhungVanDuy commented 1 year ago

Title

Dataset URL - LinusTechTip

Does the dataset exist in a scraped format? No

Description

This well-known programming forum, just scanned there have more than 10.000 topics from 2013

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
bentrevett commented 1 year ago

Wrote a quick scraper for this, unsure of the format required, but this writes each page of a thread as JSON file per line.

https://gist.github.com/bentrevett/274db7de0258bab8adf235045344bed7

There's two types of threads: