CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

UseNet #16

Closed Jehoshaph closed 1 year ago

Jehoshaph commented 1 year ago

UseNet

Dataset URL - UsenetArchives and InternetArchive

Does the dataset exists in a scraped format ? No

Description

Procedure

Jehoshaph commented 1 year ago

Working on the cleanup and de-duplicate scripts

ncoop57 commented 1 year ago

@Jehoshaph Might be better to just directly use the existing usenet dump that you linked or this uncleaned version they mention: https://aws.amazon.com/datasets/the-westburylab-usenet-corpus/.

Is there any other dedup/cleanup steps you want to run?

Jehoshaph commented 1 year ago

A couple of potential issues with that dataset. They still haven't responded to my request for access to the dataset.

  1. Their dataset only covers 2005 - 2010. The archives on Internet Archive go back to 1997.
  2. "Documents that contained less than 90% English words were omitted. (English words were defined as words that are contained in a 100,000 words dictionary of english)." - I don't know if this excluded the coding archives, I'd need access to the dataset to check if archives like Java programming, etc. survived.

Is there any other dedup/cleanup steps you want to run?

The only additional step is excluding messages with no replies.

ncoop57 commented 1 year ago

@Jehoshaph

ah okay, thanks for clearing that up. Yeah in that case, it makes sense to do it ourselves