IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 62 forks source link

Create dataset loader for Indonesian Poem Tweets #214

Open SamuelCahyawijaya opened 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_poem_tweets

Dataset id_poem_tweets
Description Indonesian Poem tweets is dataset crawled from Twitter. The purpose of this data is to create text generation model for short text and make sure they are all coherence and rhythmic
License CC-BY 4.0
aliakbars commented 2 years ago

self-assign

bryanwilie commented 2 years ago

Hi @aliakbars , are you still working on this? I will assume inactivity if there's no reply and will free the assignees. Thanks!

aliakbars commented 2 years ago

Hi, @bryanwilie Yes. Working on this. I'll create the PR asap. Sorry for the delay.

bryanwilie commented 2 years ago

No worries @aliakbars, please take your time. Thank you for contributing!

aliakbars commented 2 years ago

Just did some exploratory data analysis. I found that the tweets are only from 6 users (might be a retweet). Also, it's not filtered yet. Some of the tweets are replies, e.g.

"RT : Siap-siap"

or an image/video, e.g.

"RT : https://t.co/Z6Ls07s1bn"

Should we proceed with this?

It does have local languages, e.g. Sundanese, though.

aliakbars commented 2 years ago

@bryanwilie What do you think about this issue?

aliakbars commented 2 years ago

@SamuelCahyawijaya @holylovenia

SamuelCahyawijaya commented 2 years ago

Hi @aliakbars : thank you for the update and I apologize for the late reply. Later on, we plan to label the quality for all the datasets in NusaCatalogue, so we can push this one through first for now.