Data source: Wikipedia Talk Pages

bitplane commented 1 year ago

These could make a nice question/answer data format, they're kind-of tree like but would need a lot of filtering and formatting into trees.

Some initial research if anyone wants to have a stab at this:

The dataset is licensed under CC-SA-BY
Data dumps are at https://dumps.wikimedia.org/enwiki/latest/ (replace en with language code of your choice)
Talk pages are included in "pages-meta-current" and match [a-zA-Z]+wiki\-[0-9]+\-pages\-meta\-current[0-9]+\.xml.*\.bz2
They are 1GB each when unzipped. bzip2 --decompress some_file.bz2 creates some_file.bz2
You'll want to use a SAX parser rather than load it all in one go, due to the file size
Filter by <page> elements then by <ns>1</ns> to get talk pages, and extract the <id> and <text>
All text is in MediaWiki WikiMarkup
Conversation topic starts with ==Topic== if there is one set
Messages end with a person's user link and a timestamp [[User:chummer|chummer]] 05:45, 13 Feb 2005 (UTC). The user link may be an IP address, both are PII
Replies to messages start on a line with some number of :s to show how indented it is, this combined with the above is how to build the thread.
There's a lot of bile and arguments in there, so it could be useful for an argumentation model (lol) or at least need a lot of filtering.
Much of it talks about what needs to be done or is done, might be worth filtering that out too.

olliestanley commented 1 year ago

This sounds fun and I would like to have a go at doing this, brand new to this project though so I am not sure how I should parse this into a data format that can be understood by the rest of the system. Any pointers on where to look?

That would be great, thanks! Some already completed data scraping and formatting notebooks can be found here to use as reference: https://github.com/LAION-AI/Open-Assistant/tree/main/notebooks

mcleantom commented 1 year ago

This sounds fun and I would like to have a go at doing this, brand new to this project though so I am not sure how I should parse this into a data format that can be understood by the rest of the system. Any pointers on where to look?

That would be great, thanks! Some already completed data scraping and formatting notebooks can be found here to use as reference: https://github.com/LAION-AI/Open-Assistant/tree/main/notebooks

sweet, thanks i will have a go :)

bitplane commented 1 year ago

Awesome, data schemas are here:

https://projects.laion.ai/Open-Assistant/docs/data/schemas

huu4ontocord commented 1 year ago

@mcleantom can you give us a status update please?

mcleantom commented 1 year ago

@mcleantom can you give us a status update please?

Been working on it and so far I can:

Download the data dumps, extract it to a data folder
Find the talk pages in each of the data dumps, extract the topic name etc
Split the talk pages into sections ie by == title ==
Start to split the replies under each section into lists, ready to be converted to a tree structure based on the indentation level. However even for this problem, there are so many ways to do indentation:

* Support. I like this idea. —User:Example 
** Question: What do you like about it? —User:Example2
*** It seems to fit the spirit of Wikipedia. —User:Example

or

: Support. I like this idea. —User:Example 
:: Question: What do you like about it? —User:Example2
::: It seems to fit the spirit of Wikipedia. —User:Example

or

* Support. I like this idea. —User:Example 
*: Question: What do you like about it? —User:Example2
*:: It seems to fit the spirit of Wikipedia. —User:Example

or

: Support. I like this idea. —User:Example 
:* Question: What do you like about it? —User:Example2

Then, to break up long chains you can also do:

{{od2}} new reply
{{od2|:}} new reply 2
{{od2|::}} new reply 3
{{od|:}} new reply 4
{{od|::} new reply 5

But then on top of that, you can have someone make a bullet point list:

* One
* Two
** Two point one
* Three

And ontop of that, lots of talk pages just dont follow these rules anyway but it hard to detect when they dont follow the rules. One way to detect this would be to see if a reply jumps two indentation levels at a time, i.e.:

: Support. I like this idea. —User:Example
::: It seems to fit the spirit of Wikipedia. —User:Example

Then I need to remove all the syntax elements in the text, however, there are so many different syntax elements https://en.wikipedia.org/wiki/Help:Cheatsheet so I need a way of just implementing features one by one and not parsing text with unimplemented syntax.

I searched around for a while for some libraries to help me do this, however didnt find anywhere. However looking again right now, I have found some libraries that might help me to not implement this myself:

This second library has some good examples of solving the problems I have struggled to do myself, i.e.

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']

So I will try again with one of these libraries this week.

One question: How should I deal with formatting elements, i.e. should I convert Some bold text into Some bold text?

Edit: Those libraries I think will make it a lottttttttttt easier.

bitplane commented 1 year ago

Nice work and analysis, thank you :)

https://github.com/earwig/mwparserfromhell/

Very apt name :joy:

Since there's so much of it, I guess it doesn't hurt to throw a fair portion of the data away. I might be wrong here (someone please correct me if I am!) but I think that even if only a minority of trees remain intact it'll still be a very valuable data source. As long as the filters are documented and we get some kind of measure like "captures x% of talk content" type disclaimer, then it can be improved on in future.

Should I convert Some bold text into Some bold text?

Doing it twice is probably best - wiki markup to HTML, then HTML to markdown to cater for the mixed case. So it'd go '''Some bold text''' -> Some bold text -> **Some bold text**. The HTML is valid in our Markdown output but I think we'd prefer that over HTML. Or maybe just strip formatting for now if it's too much of a pain.

Links might be worth investigating too, right at the start of the data there's some [positive statement](Hitler reference) sarcasm going on - I'm not sure if that's a common/cultural thing over there; it might be best to drop threads that have link titles that have no matching words in the link.

But I don't think it has to be perfect for a first pass. That would be nice, but realistically these things are big and will take a lot of work to get right. We'll need peer review, to identify weak spots and iterate by creating more tickets to address the issues as they're discovered. Obviously do what you can, but IMO it's best to have something and have people improve on it over time than be overwhelmed by it.

mcleantom commented 1 year ago

I was overwhelmed by trying to make my own wikitext parser but that second library is really good actually. It has some really helpful functions for:

removing the formatting from the text

>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'

finding and iterating over nested lists

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']
>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']

So these two functions should mean I should be able to parse this text a lot faster, then hopefully we can build up and make improvements to the filtering and cleaning from there.

And yes I was planning on just dumping any data that had any error I could detect, there is so much available

abodacs commented 1 year ago

Wikimedia Foundation released a Python package to easily work with HTML dumps.

https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/

https://gitlab.wikimedia.org/repos/research/html-dumps/

This package can help @mcleantom

mcleantom commented 1 year ago

Hey I am going to try and upload my parsed data onto hugging face however I am a bit confused on how I should do this. I have parsed my data as a tree, using a Pydantic model that looks like:

class ConversationTreeNodeMetaData(BaseModel):
    username: str
    timestamp: datetime

class RoleEnum(str, Enum):
    prompter = "prompter"
    assistant = "assistant"

class ConversationTreeNode(BaseModel):
    text: str
    role: RoleEnum
    children: List[ConversationTreeNode]
    metadata: ConversationTreeNodeMetaData

class ConversationTreeMetaData(BaseModel):
    title: str
    topic: str

class ConversationTree(BaseModel):
    root: ConversationTreeNode
    metadata: ConversationTreeMetaData

So each talk page has the first comment as the root node, then the replies to that are a branch etc.

In the folder /data/datasets it says the format should be a table with instruction response pairs.

Should I just be looping over the prompt response pairs of the tree and appending rows to a table?

In v0.0.1-beta48 there is a template for doing the datasets

However in the most recent version, the folder /openassistant/templates is gone and the folder /data/dataset is made. This gets rid of the template and most of the examples in there seem to just provide a script to upload the data to huggingface. Do I need to fill out the code for hub.py, prepare.py and template.py or do I just need to provide a script to upload the data to hugging face?

olliestanley commented 1 year ago

@mcleantom You only need to provide a script to parse the data and upload it to HuggingFace in the format described here. Then you can add your dataset with HuggingFace URL to the list in __init__.py as described here.

LAION-AI / Open-Assistant

Data source: Wikipedia Talk Pages #1523