Closed SamuelCahyawijaya closed 7 months ago
@holylovenia @SamuelCahyawijaya @sabilmakbar
Do you think LANGUAGE_IDENTIFICATION
task can be implemented for this dataset? Texts in cs
dataset contain English and another language.
@holylovenia @SamuelCahyawijaya @sabilmakbar
Do you think
LANGUAGE_IDENTIFICATION
task can be implemented for this dataset? Texts incs
dataset contain English and another language.
Instead of language identification, I think code-switching identification is more appropriate. Could you please add it in the constants.py?
Also, it seems like LANGUAGE_MODELING
as per the issue ticket is not the intended task for this dataset. Should we change it to CODE_SWITCHING_IDENTIFICATION
? What is your opinion on this, @SamuelCahyawijaya (as the datasheet reviewer) and @elyanah-aco (as the datasheet contributor)?
@holylovenia Adding CODE_SWITCHING_IDENTIFICATION
for me sounds good, that wasn't available when I submitted the dataset. But I think eng_monolingual
subset may still fall under LANGUAGE_MODELING
, as it just contains English comments from SEA subreddits.
@holylovenia Adding
CODE_SWITCHING_IDENTIFICATION
for me sounds good, that wasn't available when I submitted the dataset. But I thinkeng_monolingual
subset may still fall underLANGUAGE_MODELING
, as it just contains English comments from SEA subreddits.
Noted. I've added the "Code-switching Identification" task to the datasheet. Please feel free to reach out to me to add a new task if needed next time.
Thanks for clarifying, @elyanah-aco. Let's keep both the code-switching identification task and the language modeling task.
@holylovenia
I'll add CODE_SWITCHING_IDENTIFICATION
task using text_multi
schema with language codes as features, does that sound good?
I'm also running into an issue where I want to implement one schema for one subset, but not for the other one. In this case, I don't want to run seacrowd_ssp
on cs
subset and seacrowd_text_multi
on eng_monolingual
. But currently, unit tests run all schemas on all subsets. Any way to fix this?
@holylovenia @elyanah-aco : CMIIW, in this case, code-switching identification refers to detecting whether a given sentence is code-switching or not, right? This is the first time I see this task, a more common task would be on sequence tagging to identify which span corresponds to which language. Personally, I prefer to keep it simple and just keep the LANGUAGE_MODELING
task, since the other one is uncommon and there is no split defined for the task. Nonetheless, I don't see any potential problem that might occur from implementing such tasks. So we can implement both LANGUAGE_MODELING
and CODE_SWITCHING_IDENTIFICATION
.
I'm also running into an issue where I want to implement one schema for one subset, but not for the other one. In this case, I don't want to run seacrowd_ssp on cs subset and seacrowd_text_multi on eng_monolingual. But currently, unit tests run all schemas on all subsets. Any way to fix this?
Don't we implement seacrowd_ssp
on the CS subset?
You can use --subset_id
if you have the subset that supports all the tasks, otherwise, I think you can test it manually using datasets.load_dataset(<dataset_name>, name=<config_name>)
. If you do so, please inform how to run the test on the PR.
I think you can test it manually using datasets.load_dataset(
, name= ). If you do so, please inform how to run the test on the PR.
Based on the test code, you can pass the --schema
args to ensure the schema used for the config is correctly used (i.e using CODE_SWITCHING_IDENTIFICATION
schema for cs
subset or using LANGUAGE_MODELING
schema for eng_monolingual
).
Or, we might consider not implementing eng_monolingual
in SEACrowd for Language Modelling since the language itself isn't coming from the SEA Region.
Or, we might consider not implementing
eng_monolingual
in SEACrowd for Language Modelling since the language itself isn't coming from the SEA Region.
Here is a passage I extracted from the paper:
We observed that country-specific subreddits (e.g., r/greece and r/philippines) often contained posts both in English and in the language of the country specified (e.g., Greek and Tagalog, respectively). We thus restricted our extraction to all country-specific subreddits, except for countries with English as a national language, e.g., r/Australia.
...
We also compiled an additional dataset of English monolingual posts from the same country-specific subreddits as our code-switched corpus.
I think even the eng_monolingual
is (at least partly) from SEA speakers in this case, @sabilmakbar. Is there a way to separate data instances written by SEA subreddits from the rest of eng_monolingual
, @elyanah-aco?
Is there a way to separate data instances written by SEA subreddits from the rest of
eng_monolingual
?
Yes, we can keep only the ff. subreddits: indonesia, Philippines, singapore
.
Based on discussion so far, I'll implement source
schema for both subsets, seacrowd_ssp
for eng_monolingual
and seacrowd_text_multi
for cs
.
Is there a way to separate data instances written by SEA subreddits from the rest of
eng_monolingual
?Yes, we can keep only the ff. subreddits:
indonesia, Philippines, singapore
.Based on discussion so far, I'll implement
source
schema for both subsets,seacrowd_ssp
foreng_monolingual
andseacrowd_text_multi
forcs
.
That's great then. Thanks a lot, @elyanah-aco!!
Dataloader name:
codeswitch_reddit/codeswitch_reddit.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?codeswitch_reddit