Open bitplane opened 1 year ago
This sounds fun and I would like to have a go at doing this, brand new to this project though so I am not sure how I should parse this into a data format that can be understood by the rest of the system. Any pointers on where to look?
That would be great, thanks! Some already completed data scraping and formatting notebooks can be found here to use as reference: https://github.com/LAION-AI/Open-Assistant/tree/main/notebooks
This sounds fun and I would like to have a go at doing this, brand new to this project though so I am not sure how I should parse this into a data format that can be understood by the rest of the system. Any pointers on where to look?
That would be great, thanks! Some already completed data scraping and formatting notebooks can be found here to use as reference: https://github.com/LAION-AI/Open-Assistant/tree/main/notebooks
sweet, thanks i will have a go :)
Awesome, data schemas are here:
@mcleantom can you give us a status update please?
@mcleantom can you give us a status update please?
Been working on it and so far I can:
== title ==
* Support. I like this idea. —User:Example
** Question: What do you like about it? —User:Example2
*** It seems to fit the spirit of Wikipedia. —User:Example
or
: Support. I like this idea. —User:Example
:: Question: What do you like about it? —User:Example2
::: It seems to fit the spirit of Wikipedia. —User:Example
or
* Support. I like this idea. —User:Example
*: Question: What do you like about it? —User:Example2
*:: It seems to fit the spirit of Wikipedia. —User:Example
or
: Support. I like this idea. —User:Example
:* Question: What do you like about it? —User:Example2
Then, to break up long chains you can also do:
{{od2}} new reply
{{od2|:}} new reply 2
{{od2|::}} new reply 3
{{od|:}} new reply 4
{{od|::} new reply 5
But then on top of that, you can have someone make a bullet point list:
* One
* Two
** Two point one
* Three
And ontop of that, lots of talk pages just dont follow these rules anyway but it hard to detect when they dont follow the rules. One way to detect this would be to see if a reply jumps two indentation levels at a time, i.e.:
: Support. I like this idea. —User:Example
::: It seems to fit the spirit of Wikipedia. —User:Example
Then I need to remove all the syntax elements in the text, however, there are so many different syntax elements https://en.wikipedia.org/wiki/Help:Cheatsheet so I need a way of just implementing features one by one and not parsing text with unimplemented syntax.
I searched around for a while for some libraries to help me do this, however didnt find anywhere. However looking again right now, I have found some libraries that might help me to not implement this myself:
This second library has some good examples of solving the problems I have struggled to do myself, i.e.
>>> parsed = wtp.parse(
... 'text\n'
... '* list item a\n'
... '* list item b\n'
... '** sub-list of b\n'
... '* list item c\n'
... '** sub-list of b\n'
... 'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']
So I will try again with one of these libraries this week.
One question: How should I deal with formatting elements, i.e. should I convert <b>Some bold text</b>
into Some bold text
?
Edit: Those libraries I think will make it a lottttttttttt easier.
Nice work and analysis, thank you :)
Very apt name :joy:
Since there's so much of it, I guess it doesn't hurt to throw a fair portion of the data away. I might be wrong here (someone please correct me if I am!) but I think that even if only a minority of trees remain intact it'll still be a very valuable data source. As long as the filters are documented and we get some kind of measure like "captures x% of talk content" type disclaimer, then it can be improved on in future.
Should I convert
<b>Some bold text</b>
intoSome bold text
?
Doing it twice is probably best - wiki markup to HTML, then HTML to markdown to cater for the mixed case. So it'd go '''Some bold text'''
-> <b>Some bold text</b>
-> **Some bold text**
. The HTML is valid in our Markdown output but I think we'd prefer that over HTML. Or maybe just strip formatting for now if it's too much of a pain.
Links might be worth investigating too, right at the start of the data there's some [positive statement](Hitler reference)
sarcasm going on - I'm not sure if that's a common/cultural thing over there; it might be best to drop threads that have link titles that have no matching words in the link.
But I don't think it has to be perfect for a first pass. That would be nice, but realistically these things are big and will take a lot of work to get right. We'll need peer review, to identify weak spots and iterate by creating more tickets to address the issues as they're discovered. Obviously do what you can, but IMO it's best to have something and have people improve on it over time than be overwhelmed by it.
I was overwhelmed by trying to make my own wikitext parser but that second library is really good actually. It has some really helpful functions for:
removing the formatting from the text
>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'
finding and iterating over nested lists
>>> parsed = wtp.parse(
... 'text\n'
... '* list item a\n'
... '* list item b\n'
... '** sub-list of b\n'
... '* list item c\n'
... '** sub-list of b\n'
... 'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']
>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']
So these two functions should mean I should be able to parse this text a lot faster, then hopefully we can build up and make improvements to the filtering and cleaning from there.
And yes I was planning on just dumping any data that had any error I could detect, there is so much available
Wikimedia Foundation released a Python package to easily work with HTML dumps.
https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/
https://gitlab.wikimedia.org/repos/research/html-dumps/
This package can help @mcleantom
Hey I am going to try and upload my parsed data onto hugging face however I am a bit confused on how I should do this. I have parsed my data as a tree, using a Pydantic model that looks like:
class ConversationTreeNodeMetaData(BaseModel):
username: str
timestamp: datetime
class RoleEnum(str, Enum):
prompter = "prompter"
assistant = "assistant"
class ConversationTreeNode(BaseModel):
text: str
role: RoleEnum
children: List[ConversationTreeNode]
metadata: ConversationTreeNodeMetaData
class ConversationTreeMetaData(BaseModel):
title: str
topic: str
class ConversationTree(BaseModel):
root: ConversationTreeNode
metadata: ConversationTreeMetaData
So each talk page has the first comment as the root node, then the replies to that are a branch etc.
In the folder /data/datasets it says the format should be a table with instruction response pairs.
Should I just be looping over the prompt response pairs of the tree and appending rows to a table?
In v0.0.1-beta48 there is a template for doing the datasets
However in the most recent version, the folder /openassistant/templates is gone and the folder /data/dataset is made. This gets rid of the template and most of the examples in there seem to just provide a script to upload the data to huggingface. Do I need to fill out the code for hub.py
, prepare.py
and template.py
or do I just need to provide a script to upload the data to hugging face?
These could make a nice question/answer data format, they're kind-of tree like but would need a lot of filtering and formatting into trees.
Some initial research if anyone wants to have a stab at this:
[a-zA-Z]+wiki\-[0-9]+\-pages\-meta\-current[0-9]+\.xml.*\.bz2
bzip2 --decompress some_file.bz2
createssome_file.bz2
<page>
elements then by<ns>1</ns>
to get talk pages, and extract the<id>
and<text>
==Topic==
if there is one set[[User:chummer|chummer]] 05:45, 13 Feb 2005 (UTC)
. The user link may be an IP address, both are PII:
s to show how indented it is, this combined with the above is how to build the thread.