SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for CodeSwitch-Reddit #356

Closed SamuelCahyawijaya closed 7 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: codeswitch_reddit/codeswitch_reddit.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?codeswitch_reddit

Dataset codeswitch_reddit
Description This corpus consists of monolingual English and multilingual (English and one other language) posts from country-specific subreddits, including r/indonesia and r/philippines for Southeast Asia. Posts were manually classified whether they contained code-switching or not.
Subsets eng_monolingual, cs
Languages eng, ind, tgl
Tasks Language Modeling
License Unknown (unknown)
Homepage https://github.com/ellarabi/CodeSwitch-Reddit?tab=readme-ov-file
HF URL -
Paper URL https://aclanthology.org/D19-1484.pdf
elyanah-aco commented 8 months ago

self-assign

elyanah-aco commented 8 months ago

@holylovenia @SamuelCahyawijaya @sabilmakbar

Do you think LANGUAGE_IDENTIFICATION task can be implemented for this dataset? Texts in cs dataset contain English and another language.

holylovenia commented 8 months ago

@holylovenia @SamuelCahyawijaya @sabilmakbar

Do you think LANGUAGE_IDENTIFICATION task can be implemented for this dataset? Texts in cs dataset contain English and another language.

Instead of language identification, I think code-switching identification is more appropriate. Could you please add it in the constants.py?

Also, it seems like LANGUAGE_MODELING as per the issue ticket is not the intended task for this dataset. Should we change it to CODE_SWITCHING_IDENTIFICATION? What is your opinion on this, @SamuelCahyawijaya (as the datasheet reviewer) and @elyanah-aco (as the datasheet contributor)?

elyanah-aco commented 8 months ago

@holylovenia Adding CODE_SWITCHING_IDENTIFICATION for me sounds good, that wasn't available when I submitted the dataset. But I think eng_monolingual subset may still fall under LANGUAGE_MODELING, as it just contains English comments from SEA subreddits.

holylovenia commented 8 months ago

@holylovenia Adding CODE_SWITCHING_IDENTIFICATION for me sounds good, that wasn't available when I submitted the dataset. But I think eng_monolingual subset may still fall under LANGUAGE_MODELING, as it just contains English comments from SEA subreddits.

Noted. I've added the "Code-switching Identification" task to the datasheet. Please feel free to reach out to me to add a new task if needed next time.

Thanks for clarifying, @elyanah-aco. Let's keep both the code-switching identification task and the language modeling task.

elyanah-aco commented 8 months ago

@holylovenia

I'll add CODE_SWITCHING_IDENTIFICATION task using text_multi schema with language codes as features, does that sound good?

I'm also running into an issue where I want to implement one schema for one subset, but not for the other one. In this case, I don't want to run seacrowd_ssp on cs subset and seacrowd_text_multi on eng_monolingual. But currently, unit tests run all schemas on all subsets. Any way to fix this?

SamuelCahyawijaya commented 8 months ago

@holylovenia @elyanah-aco : CMIIW, in this case, code-switching identification refers to detecting whether a given sentence is code-switching or not, right? This is the first time I see this task, a more common task would be on sequence tagging to identify which span corresponds to which language. Personally, I prefer to keep it simple and just keep the LANGUAGE_MODELING task, since the other one is uncommon and there is no split defined for the task. Nonetheless, I don't see any potential problem that might occur from implementing such tasks. So we can implement both LANGUAGE_MODELING and CODE_SWITCHING_IDENTIFICATION.

I'm also running into an issue where I want to implement one schema for one subset, but not for the other one. In this case, I don't want to run seacrowd_ssp on cs subset and seacrowd_text_multi on eng_monolingual. But currently, unit tests run all schemas on all subsets. Any way to fix this?

Don't we implement seacrowd_ssp on the CS subset? You can use --subset_id if you have the subset that supports all the tasks, otherwise, I think you can test it manually using datasets.load_dataset(<dataset_name>, name=<config_name>). If you do so, please inform how to run the test on the PR.

sabilmakbar commented 8 months ago

I think you can test it manually using datasets.load_dataset(, name=). If you do so, please inform how to run the test on the PR.

image

Based on the test code, you can pass the --schema args to ensure the schema used for the config is correctly used (i.e using CODE_SWITCHING_IDENTIFICATION schema for cs subset or using LANGUAGE_MODELING schema for eng_monolingual).

Or, we might consider not implementing eng_monolingual in SEACrowd for Language Modelling since the language itself isn't coming from the SEA Region.

holylovenia commented 8 months ago

Or, we might consider not implementing eng_monolingual in SEACrowd for Language Modelling since the language itself isn't coming from the SEA Region.

Here is a passage I extracted from the paper:

We observed that country-specific subreddits (e.g., r/greece and r/philippines) often contained posts both in English and in the language of the country specified (e.g., Greek and Tagalog, respectively). We thus restricted our extraction to all country-specific subreddits, except for countries with English as a national language, e.g., r/Australia.
...
We also compiled an additional dataset of English monolingual posts from the same country-specific subreddits as our code-switched corpus.

I think even the eng_monolingual is (at least partly) from SEA speakers in this case, @sabilmakbar. Is there a way to separate data instances written by SEA subreddits from the rest of eng_monolingual, @elyanah-aco?

elyanah-aco commented 8 months ago

Is there a way to separate data instances written by SEA subreddits from the rest of eng_monolingual?

Yes, we can keep only the ff. subreddits: indonesia, Philippines, singapore.

Based on discussion so far, I'll implement source schema for both subsets, seacrowd_ssp for eng_monolingual and seacrowd_text_multi for cs.

holylovenia commented 7 months ago

Is there a way to separate data instances written by SEA subreddits from the rest of eng_monolingual?

Yes, we can keep only the ff. subreddits: indonesia, Philippines, singapore.

Based on discussion so far, I'll implement source schema for both subsets, seacrowd_ssp for eng_monolingual and seacrowd_text_multi for cs.

That's great then. Thanks a lot, @elyanah-aco!!