SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for HSE Thai Corpus #113

Open SamuelCahyawijaya opened 10 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: hse_thai/hse_thai.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?hse_thai

Dataset hse_thai
Description he HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand.
Subsets -
Languages tha
Tasks Language Modeling, Language Identification
License Apache license 2.0 (apache-2.0)
Homepage https://www.kaggle.com/datasets/rtatman/hse-thai-corpus
HF URL -
Paper URL https://www.kaggle.com/datasets/rtatman/hse-thai-corpus/data
bp-high commented 10 months ago

self-assign

github-actions[bot] commented 9 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

bp-high commented 9 months ago

Yep working on this issues got busy with some things but will try to wrap this issues by next week.

sabilmakbar commented 9 months ago

Thanks for letting us know, @bp-high, I'm removing the stale tag for now. Please add a tag pr-ready whenever you have finished on your dataloader so that the bot won't tag this issue as stale or let us know if you need more time for this issue.

github-actions[bot] commented 9 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

bp-high commented 9 months ago

Sorry couldn't work on this last weekend due to christmas holidays and celebration will try to conclude this, this weekend.

sabilmakbar commented 9 months ago

Thanks for the update, @bp-high! no rush on this; please take your time to enjoy ur holiday!

github-actions[bot] commented 8 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar commented 8 months ago

Hi @bp-high, may we know the update on this dataloader issue? It's been 3 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 commented 7 months ago

self-assign

khelli07 commented 7 months ago

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values: image should we still include this?

  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture image

holylovenia commented 7 months ago

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values: image should we still include this?
  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture image

Hi @khelli07, I'm also not sure what the content is about since I don't understand Thai. May I ask for your suggestion on this dataset, @mrpeerat and @parinzee? 🙏

mrpeerat commented 7 months ago

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values: image should we still include this?
  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture image
  1. I looked at some samples and found that those are duplicate texts. Feel free to pick only one of them.
  2. Look like the article column is the header of wikipedia. Picking only the text column is fine.
khelli07 commented 6 months ago

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

holylovenia commented 6 months ago

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

Hi @khelli07, if it can be solved using CLI, could we make it _LOCAL = False and attach a guide on how to use it to the _DESCRIPTION like this?

khelli07 commented 6 months ago

Main code is done, just have not done the metadata yet. I'll do it in near future.