CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Usenet #30

Closed Jehoshaph closed 1 year ago

Jehoshaph commented 1 year ago

Adds UseNet download and processing scripts. Refer issue #16

reshinthadithyan commented 1 year ago

Hello, thanks for the PR. I just had a brief look, we want to inherit classes from this base class. https://github.com/CarperAI/Code-Pile/blob/working/codepile/dataset.py Please have a look.

StellaAthena commented 1 year ago

Do we have an estimate for how big this dataset is?

Jehoshaph commented 1 year ago

Do we have an estimate for how big this dataset is?

Usenet Comp (link) is around 30 GB

ncoop57 commented 1 year ago

Hey @Jehoshaph, how is the status of this PR?

Jehoshaph commented 1 year ago

Hey @Jehoshaph, how is the status of this PR?

It still needs some work, I'll be updating the PR soon.

Jehoshaph commented 1 year ago

Hey @Jehoshaph, how is the status of this PR?

@ncoop57 Updated! Ready for review.

ncoop57 commented 1 year ago

@Jehoshaph could you fix the test/usenet-comp.parquet dummy data? I tried running the tests, but it failed since it is a folder rather than the parquet file. Also, could you merge the changes from working into your local branch.

Jehoshaph commented 1 year ago

@Jehoshaph could you fix the test/usenet-comp.parquet dummy data? I tried running the tests, but it failed since it is a folder rather than the parquet file. Also, could you merge the changes from working into your local branch.

@ncoop57 That is strange, the test works for me. I am using pyarrow to incrementally add to parquet since I am very memory constrained on my machine. Pyarrow parquet works on directories, and calling pyarrow.parquet.read_table('dir_name') works the same as a file. I've moved the test file and made some updates, let me know if it still does not work, I will need to investigate this more.

Merged working into my local branch.

ncoop57 commented 1 year ago

Works like a charm, thanks @Jehoshaph !