CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Programming and Computing Books #12

Closed taisazero closed 1 year ago

taisazero commented 2 years ago

Open Access and Free Programming and Computing Books

Dataset URL - Computing Wikibooks. We can download the dump here and filter for computing wikibooks. Free Computing Books -- not sure if the books on here are safe to use we need to check.

Does the dataset exist in a scraped format? Yes if HTML/website No if the book is in PDF.

Description

Books contain rich information and present cumulations of knowledge on specific topics. It could also be home to exercises and solutions. If a model is pretrained on it could perhaps enhance its chain of thought capabilities.

Procedure

taisazero commented 2 years ago

Who wants to work on this with me? We need help from scarpers

PhungVanDuy commented 2 years ago

@taisazero let me work on this one. Can you update your current status?

taisazero commented 2 years ago

Perfect! We can work on it together perhaps. I haven't started to be honest. We can coordinate on Discord. Tag @Erfan in the CarperAI discord #code-pile. @PhungVanDuy

PhungVanDuy commented 2 years ago

Books dev branch: https://github.com/CarperAI/Code-Pile/tree/books_dataset

PhungVanDuy commented 2 years ago

@taisazero New dev folk here: https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset

Updated dataset keep code block this

Next steps I will work on free-programming-books is that okay?

taisazero commented 2 years ago

@taisazero New dev folk here: https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset

Updated dataset keep code block this

Next steps I will work on free-programming-books is that okay?

Yes, that's perfect! Whenever you're ready make a PR for Wikibooks though! Do you want to work on PDFs or websites first?

ncoop57 commented 2 years ago

Wow this is freaking fire!!!!! I'm so excited for this subset 🤓

PhungVanDuy commented 2 years ago

PR: https://github.com/CarperAI/Code-Pile/pull/22

@taisazero @ncoop57 please review and give me a comment, maybe I miss something about the code structure for our CodePile project. It's will help me improve my later PR.

PhungVanDuy commented 2 years ago

@taisazero Do you have any priority list to work on? I have planned on this free-programming-books for PDF first, but if you have any priority list it would be great.

PhungVanDuy commented 2 years ago

Besides the resource from github link above, I will try to get book from BookSC. If have any problem about license we just index not release.

ncoop57 commented 2 years ago

@PhungVanDuy and @taisazero please the following information to the issue description:

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
ncoop57 commented 1 year ago

gonna close this as it seems it is done in our spreadsheet.