SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Create dataset loader for NUS SMS Corpus #221

Open SamuelCahyawijaya opened 6 months ago

SamuelCahyawijaya commented 6 months ago

Dataloader name: nus_sms_corpus/nus_sms_corpus.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?nus_sms_corpus

Dataset nus_sms_corpus
Description This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. This dataset consists of 67,093 SMS messages taken from the corpus on Mar 9, 2015. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The data collectors opportunistically collected as much metadata about the messages and their senders as possible, so as to enable different types of analyses.
Subsets English, Mandarin Chinese
Languages eng, cmn
Tasks Language Modeling
License Unknown (unknown)
Homepage https://github.com/kite1988/nus-sms-corpus
HF URL -
Paper URL https://link.springer.com/article/10.1007/s10579-012-9197-9
reynardryanda commented 6 months ago

self-assign

github-actions[bot] commented 6 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar commented 5 months ago

Hi @reynardryanda, may we know the update on this dataloader issue? It's been 3 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

github-actions[bot] commented 5 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

reynardryanda commented 4 months ago

I will try to create a PR by this weekend or next week if that's okay with you guys. So sorry for taking so long.

reynardryanda commented 4 months ago

Hello @sabilmakbar or maybe @holylovenia, I think this corpus does not have a clear downstream task. The paper also concluded that the corpus may need further annotations for it to be used on other projects. Any suggestions? Please also check the sample data, just in case that I might be wrong.

holylovenia commented 4 months ago

Hello @sabilmakbar or maybe @holylovenia, I think this corpus does not have a clear downstream task. The paper also concluded that the corpus may need further annotations for it to be used on other projects. Any suggestions? Please also check the sample data, just in case that I might be wrong.

Hi @reynardryanda, sorry I missed your comment. We can use Tasks.LANGUAGE_MODELING and the ssp schema for unlabeled data like this.

Here's the link to constants.py just in case you want to take a look at other tasks and schemas available.

akhdanfadh commented 3 months ago

@reynardryanda may we know if you are still working on this issue? It has already been one month since your last update.

holylovenia commented 3 months ago

@reynardryanda may we know if you are still working on this issue? It has already been one month since your last update.

I removed @reynardryanda assignment due to the lack of response. Anyone can take this dataloader now.

akhdanfadh commented 3 months ago

self-assign