SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Closes #536 | Add/Update Dataloader Onto4All #635

Open patrickamadeus opened 3 months ago

patrickamadeus commented 3 months ago

Closes #536

Checkbox

TESTS

image

NOTES

Please use huggingface-cli or insert your API_KEY to token= parameter in load_dataset method since this is a gated dataset :smile:

patrickamadeus commented 1 month ago

Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:

  1. Somehow the conversation data is empty on both _source and _seacrowd_* schemas
Screenshot 2024-05-02 at 8 45 56 AM
  1. For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Hi @SamuelCahyawijaya ! Thank you for the review 😄 .

Please kindly check the latest commit for the fix.

holylovenia commented 1 month ago
  1. For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Sorry for the late reply, I missed this mention. Is this supposedly to accommodate a multi-turn chat template with the user, assistant, and system roles?

While this qa schema seems to fit the dataset well, I think it's better if this dataloader has a different task (e.g., MULTI_TURN_CONVERSATION) with a new schema (with a messages variable like this) to facilitate similar datasets in the future. It will prevent this dataset from being overlooked as another QA task too.

What do you think, @patrickamadeus @SamuelCahyawijaya @yongzx?

cc: @sabilmakbar

SamuelCahyawijaya commented 1 month ago

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

holylovenia commented 1 month ago

should we propose the new schema and adjust the score accordingly?

I'm of this opinion.

the schema would be basically consists of input, output, and meta.

  • input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
  • output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
  • meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

Let's normalize it for the seacrowd schema.

SamuelCahyawijaya commented 1 month ago

@patrickamadeus : would it be ok for you to create the new schema, and adjust the dataloader accordingly?

patrickamadeus commented 1 month ago

With pleasure @SamuelCahyawijaya !

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

  • input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
  • output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
  • meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

I will refer the schema from here for now.

patrickamadeus commented 1 month ago

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

holylovenia commented 1 month ago

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

The schema looks great to me! Let us know if you've separated the schema and new task so we can approve it.

patrickamadeus commented 1 month ago

It's done @SamuelCahyawijaya @holylovenia .

holylovenia commented 1 month ago

It's done @SamuelCahyawijaya @holylovenia .

Could you please link the PR for the new schema and task here, @patrickamadeus?

cc: @sabilmakbar because I'll put more focus on the experiments going forward.

patrickamadeus commented 1 month ago

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

holylovenia commented 1 month ago

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

Yes, it'd be great if we could have a separate PR. Thanks in advance, @patrickamadeus!!

patrickamadeus commented 1 month ago

Hi! here is the chat schema PR #679 @sabilmakbar @holylovenia

sabilmakbar commented 1 month ago

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

holylovenia commented 1 month ago

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

TOD relies on belief state and system act apart from the utterances. In practice, most TOD works are a derivative of or follow the WOZ dataset's style, so it would be better to keep that schema for TOD.

holylovenia commented 1 month ago

Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @SamuelCahyawijaya @sabilmakbar

holylovenia commented 15 hours ago

Hi @patrickamadeus, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @yongzx @SamuelCahyawijaya @sabilmakbar