LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.08k stars 3.24k forks source link

Eksisozluk Dataset (A more popular Reddit-like website, for the Turkish language) #2376

Open kerem0comert opened 1 year ago

kerem0comert commented 1 year ago

Site Description

Ekşi Sözlük is a collaborative hypertext dictionary based on the concept of Web sites built up on user contribution. Founded in 1999, it can be thought of a mix between Reddit and UrbanDictionary.

It is by far the most popular internet forum for the Turkish language. Stats from the year 2022 show that the website has seen 363 million unique visitors with over 8 billion individual page visits, with around 3.5 million daily visits on average.

As an online public sphere, Ekşi Sözlük is not only utilized by thousands for information sharing on various topics ranging from scientific subjects to everyday life issues, but also used as a virtual socio-political community to communicate disputed political contents and to share personal views. As such, the subjects discussed there ranges anywhere from politics to science and relationships.

Creating the dataset

Eksipy

There exists several APIs to crawl Eksisozluk data, however from my previous experience eksipy seems to be a simple and effective solution. It allows easy retrieval of each user entry for any given topic. One can create a automated script to crawl through different topics and create snapshots of .parquet data every so often. Also a Job can be created which runs daily to crawl that given day's discussions for up-to-date data.

Overall, with the current state of Open-Assistant Stats showing a significant lack of data in the Turkish language, I believe Eksisozluk will provide a great amount of naturally generated data, ranging from serious to casual.

Dataset Format

Each eksisozluk thread consists of a topic (baslik) and entries (giri) under it from each user. In general, each user adds one entry per each topic. Common website guidelines say that "Each entry should be structured as if it is a dictionary entry (hence the literal translation of the website name, Eksi Sozluk = Sour Dictionary). This can in theory provide higher quality data which aligns with the answers of a virtual assistant. Each entry is only one-level deep (in other words, there cannot be replies under each entry), which simplifies the format of the dataset. Another good thing about eksisozluk is that the format of each user content needs to adhere to certain guidelines. This process is moderated and entries that do not adhere to certain rules are deleted (Note that this was more heavily regulated in the past by the moderators. If the structure of the data is a concern, we can always include older entries that were under heavier moderation).

I see that you define several dataset types. In my view, the Eksisozluk Dataset can be structured to work as either a Text-only Dataset or a Instruction Dataset.

Text-only Dataset

Following the structure defined, each individual entry can be the TEXT, the TOPIC can be the SOURCE and METADATA may include the username and the date. As an example, this entry can be structured as follows: TEXT (string) SOURCE (string) METADATA (string)
gitar calmak icin kullanilan minik plastik garip nesne. pena "{"author":"ssg","date":"1999-02-15"}"

Here: topic: pena (guitar pick) entry: gitar calmak icin kullanilan minik plastik garip nesne (tiny plastic strange object used to play the guitar)

Instruction Dataset

Alternatively, each entry can be formulated as a question-answer pair, posed to the assistant. Each topic can be prepended with an appropriate question, which can be selected at random for each entry. Naturally, each row should be in Turkish. Alternative prepending questions could be:

Using the same example entry, we can formulate such a row: INSTRUCTION (string) RESPONSE (string) METADATA (string)
pena hakkinda ne dusunuyorsun? gitar calmak icin kullanilan minik plastik garip nesne. "{"author":"ssg","date":"1999-02-15"}"

Let me know what you think! Looking forward to your feedback and opinions before I get to work.

SUPERMASSlVE commented 1 year ago

That'd be a great gem. Albeit highly biased, the aggregate source of hyperlocalized and structured data is basically gold

kerem0comert commented 1 year ago

That'd be a great gem. Albeit highly biased, the aggregate source of hyperlocalized and structured data is basically gold

Thanks, although the work for this still has not started. If this is a viable idea, I would be happy to have some input from the people who are currently working with creating datasets for the OpenAssistant.

ahmetfirat23 commented 2 months ago

@kerem0comert Are there any updates on this?