CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Google AI4Code – Kaggle #28

Closed PhungVanDuy closed 1 year ago

PhungVanDuy commented 1 year ago

Title

Google AI4Code – Understand Code in Python Notebooks

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here

Description

The dataset comprises about 160,000 Jupyter notebooks published by the Kaggle community. Jupyter notebooks are the tool of choice for many data scientists for their ability to tell a narrative with both code and natural language. These two types of discourse are contained within cells of the notebook, and we refer to these cells as either code cells or markdown cells (markdown being the text formatting language used by Jupyter).

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
PhungVanDuy commented 1 year ago

PR: https://github.com/CarperAI/Code-Pile/pull/29

cc @ncoop57 @reshinthadithyan

ncoop57 commented 1 year ago

Resolved in #29