EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Stackexchange dataset #28

Closed sdtblck closed 3 years ago

sdtblck commented 3 years ago

https://archive.org/details/stackexchange stackexchange dumps available here in xml format. ~15GB compressed. I would imagine mostly english.

Will take a bit of work to figure out how to parse the XML to raw text (ideally just Question & Top Answer, or Top n Answers?)

sdtblck commented 3 years ago

Ok - noting down my thoughts here. I may try to take this one on. From what I can tell - the data we're interested in will be contained in the Posts.xml file of each dump.

some info about the db structure from here: (from: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)

The data is structured like this:

<row Id="7" PostTypeId="1" AcceptedAnswerId="22" CreationDate="2016-01-12T20:33:56.373" Score="4" ViewCount="38" Body="bodytxt" OwnerUserId="23" LastEditorUserId="119" LastEditDate="2016-01-12T23:05:05.177" LastActivityDate="2016-01-12T23:05:23.980" Title="titletxt" Tags="&lt;bug&gt;&lt;status-completed&gt;" AnswerCount="1" CommentCount="8" ContentLicense="CC BY-SA 3.0" />

where Id = a unique Id for each post, PostTypeId = what type of post it is, can be any of the following:

1 = Question 2 = Answer 3 = Orphaned tag wiki 4 = Tag wiki excerpt 5 = Tag wiki 6 = Moderator nomination 7 = "Wiki placeholder" (seems to only be the election description) 8 = Privilege wiki

for our purposes, I guess we'll only want to look at PostTypeId 1 (question) & 2 (answer).

Any post with PostTypeId 2 (answer) will also have a 'ParentId' associated with it, where the ParentId is the unique Id of the question it is an answer to.

Any Type 1 (question) post with an 'accepted answer' will also have an 'AcceptedAnswerId' field.

So i guess the high level process would be to grab all Type 1 posts with 'AcceptedAnswerId' field & concat all Questions & Accepted answers together.

Should we also grab posts without an accepted answer, and take the must upvoted answer, or is it not worth the hassle?

StellaAthena commented 3 years ago

I think we should grab posts with an upvoted answer above some designated threshold, and to pull all answers above that threshold. I think it would actually make sense to turn questions with multiple answers into multiple documents, so that each document is a question followed by an answer. Questions would be repeated when they have multiple answers.

StellaAthena commented 3 years ago

@sdtblck @leogao2 @anishthite and I discussed this on discord. We reached a consensus to grab all answers above a threshold number of votes (tentatively 5, but experimentation is needed), displaying the accepted answer first and the rest in decreasing number of votes. One document will consist of one question and all answers that meet the threshold.

Questions without answers that meet the threshold will be excluded. Questions with answers meeting the threshold but no accepted answers will be included.

Further comment is welcome, but this is the approach we will take for now.

sdtblck commented 3 years ago

@StellaAthena I think we actually settled on solely sorting by score in the end. It's rarer that I see the most upvoted answer not be the best answer, than I see the 'accepted answer' not be the best answer, so it makes more sense to me.

The repo will be done by tonight.

sdtblck commented 3 years ago

Ok, done https://github.com/EleutherAI/stackexchange_dataset

anishthite commented 3 years ago

Adding the following for documentation purposes:

We also had a discussion over whether to make each QA pair its own document or to include all the chosen answers to each question in a single document

Basically This: Q: A1:

Q: A2: or this: Q: A1: A2: Pros for the former: - More data - Showing different answers to the same question in different documents gives the model multiple examples of how a prompt might be responded to. - Generation will give a single definitive answer Pros for the latter: - More context, specifically with the ordering of answers - Answers might reference other answers which is important context as well - Generation will give multiple answers, can achieve the single answer to questions generation scheme described in the former way through post-processing The consensus was to use the latter method.
StellaAthena commented 3 years ago

@sdtblck this appears to have not made it into datasets.py. Is there a hold up or did you just forget?

sdtblck commented 3 years ago

got distracted :[ had some other things on my plate today. Will get round to it soon.