CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
104 stars 29 forks source link

Discourse Forums #21

Open ncoop57 opened 1 year ago

ncoop57 commented 1 year ago

Discourse Forums

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Discourse is a self-hosting platform for communities to create discussions around a particular topic. They include threads of posts and an eco system to discuss a particular topic.

Procedure

ncoop57 commented 1 year ago

One thing we can do is look at common crawl for *.discourse.group as that is a commonly used url for these communities. There are a lot of other ones tho, so getting those will be a challenge. The dataset URL link I added has a thread on where we might find a large index of all sites using discourse.

ncoop57 commented 1 year ago

Some more communities: https://www.communitystack.com/communities/

jbaicoianu commented 1 year ago

I've started writing a crawler for Discourse servers, using scrapy to fetch the JSON API and save all of the post data to the filesystem. It works by taking a list of discourse server URLs (the longer the better), and will use the "latest posts", "top posts" and "categories" lists to start walking through the full database of topics. I'm currently not doing any filtering, but we can add whatever filtering rules we need at crawl time to decide which topics to save.

image

Now that I've got the crawler working for individual discourse servers, I'll start working on building a list of relevant communities to scrape. Seems there are few different sites which offer categorized lists of open discourse servers, so I can probably just throw them into the crawler too.

Will submit a pull request once I've got the crawler better integrated into the Code-Pile framework, based on conversations I had with other developers.

jbaicoianu commented 1 year ago

This query returns around 8400 results for various communities that run Discourse. Definitely needs filtering to narrow down to just code-related ones.

https://www.google.com/search?q=%22Moderators+have+special+authority%3B+they+are+responsible+for+this+forum.+But+so+are+you.+With+your+help%2C+moderators+can+be+community+facilitators%2C+not+just+janitors+or+police.%22