EleutherAI / the-pile

MIT License
1.47k stars 127 forks source link

Reddit comment data #2

Closed StellaAthena closed 3 years ago

StellaAthena commented 4 years ago

priority: medium, shouldnt be too hard since we can use their code

leogao2 commented 4 years ago

I think we decided not to use the dialogpt script, since they do a bunch of processing we don't need

StellaAthena commented 4 years ago

I must have missed that memo, thanks for the correction.

thoppe commented 4 years ago

If interested: massive data dumps are at https://pushshift.io/api-parameters/ . The guy who runs it is really into cool research projects, so he'd be able to put together any kind of subset we might want.

StellaAthena commented 4 years ago

I believe we talked about using pushshift but it has severe throttling? @leogao2 @Mistobaan what was the deal with that?

Mistobaan commented 4 years ago

I haven't been involved in it.

StellaAthena commented 3 years ago

@anishthite has uploaded the processing code here: https://github.com/EleutherAI/reddit_comment_processing

Once he provides a link to the URL the data is at this will be moved to “ready to merged.”