CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

change class construction by introducing a dataset config #15

Closed flowpoint closed 1 year ago

flowpoint commented 1 year ago

I wasn't happy with passing the same args to every implementation, so moved those into a config where we can easily store options for constructing the datasets.

This also cleaned up the cli a little. Additionally i added the extraction code for StackExchange.

Not tested yet, but too early for serious testing. These implementations should still be considered as spikes.

commit msg: …onfig, add Stackexchange comment, user, post StackExchangeDoc, add 7zip extraction to StackExchangeProcessor

ncoop57 commented 1 year ago

Might be better to split the PR into two, one for stack exchange (since that one needs more work) and one for the CLI and dataset config. Unless you wanna hold out this PR in a draft until the stackexchange is better tested

flowpoint commented 1 year ago

Sure, this is a draft. I am going to clean up and split the pr.

ncoop57 commented 1 year ago

@flowpoint do you think it is okay to go ahead and merge this in so ppl can start using the interfaces and abstract classes?

flowpoint commented 1 year ago

@ncoop57 i hope this pr is now fixed to just updating the core classes, the stackexchange script will be its own pr then.

reshinthadithyan commented 1 year ago

Looks good to me. Merging the PR.