Closed taisazero closed 1 year ago
Who wants to work on this with me? We need help from scarpers
@taisazero let me work on this one. Can you update your current status?
Perfect! We can work on it together perhaps. I haven't started to be honest. We can coordinate on Discord. Tag @Erfan in the CarperAI discord #code-pile. @PhungVanDuy
Books dev branch: https://github.com/CarperAI/Code-Pile/tree/books_dataset
@taisazero New dev folk here: https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset
Updated dataset keep code block this
Next steps I will work on free-programming-books is that okay?
@taisazero New dev folk here: https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset
Updated dataset keep code block this
Next steps I will work on free-programming-books is that okay?
Yes, that's perfect! Whenever you're ready make a PR for Wikibooks though! Do you want to work on PDFs or websites first?
Wow this is freaking fire!!!!! I'm so excited for this subset 🤓
PR: https://github.com/CarperAI/Code-Pile/pull/22
@taisazero @ncoop57 please review and give me a comment, maybe I miss something about the code structure for our CodePile project. It's will help me improve my later PR.
@taisazero Do you have any priority list to work on? I have planned on this free-programming-books for PDF first, but if you have any priority list it would be great.
Besides the resource from github link above, I will try to get book from BookSC. If have any problem about license we just index not release.
@PhungVanDuy and @taisazero please the following information to the issue description:
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
gonna close this as it seems it is done in our spreadsheet.
Open Access and Free Programming and Computing Books
Dataset URL - Computing Wikibooks. We can download the dump here and filter for computing wikibooks. Free Computing Books -- not sure if the books on here are safe to use we need to check.
Does the dataset exist in a scraped format? Yes if HTML/website No if the book is in PDF.
Description
Books contain rich information and present cumulations of knowledge on specific topics. It could also be home to exercises and solutions. If a model is pretrained on it could perhaps enhance its chain of thought capabilities.
Procedure