Open patrickamadeus opened 3 months ago
Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:
- Somehow the conversation data is empty on both
_source
and_seacrowd_*
schemas
- For the
_seacrowd_*
schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?
Hi @SamuelCahyawijaya ! Thank you for the review 😄 .
Please kindly check the latest commit for the fix.
- For the
_seacrowd_*
schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?
Sorry for the late reply, I missed this mention. Is this supposedly to accommodate a multi-turn chat template with the user
, assistant
, and system
roles?
While this qa
schema seems to fit the dataset well, I think it's better if this dataloader has a different task (e.g., MULTI_TURN_CONVERSATION
) with a new schema (with a messages
variable like this) to facilitate similar datasets in the future. It will prevent this dataset from being overlooked as another QA task too.
What do you think, @patrickamadeus @SamuelCahyawijaya @yongzx?
cc: @sabilmakbar
@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?
the schema would be basically consists of input
, output
, and meta
.
{"role": "<ROLE>", "content": "<CONTENT>" }
gpt
type
in this case.One question though, should we also normalize the <ROLE>
? Like in this dataset, it use system
, human
, and gpt
. Should it be standardized into something like system
, user
, and assistant
or we keep it as is?
should we propose the new schema and adjust the score accordingly?
I'm of this opinion.
the schema would be basically consists of
input
,output
, andmeta
.
- input would be in a form of list of dictionary
{"role": "<ROLE>", "content": "<CONTENT>" }
- output would be the expected response of the model, in this case it would be the last turn of conversation from
gpt
- meta can be used for storing other information, like
type
in this case.One question though, should we also normalize the
<ROLE>
? Like in this dataset, it usesystem
,human
, andgpt
. Should it be standardized into something likesystem
,user
, andassistant
or we keep it as is?
Let's normalize it for the seacrowd
schema.
@patrickamadeus : would it be ok for you to create the new schema, and adjust the dataloader accordingly?
With pleasure @SamuelCahyawijaya !
@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?
the schema would be basically consists of
input
,output
, andmeta
.
- input would be in a form of list of dictionary
{"role": "<ROLE>", "content": "<CONTENT>" }
- output would be the expected response of the model, in this case it would be the last turn of conversation from
gpt
- meta can be used for storing other information, like
type
in this case.One question though, should we also normalize the
<ROLE>
? Like in this dataset, it usesystem
,human
, andgpt
. Should it be standardized into something likesystem
,user
, andassistant
or we keep it as is?
I will refer the schema from here for now.
Hi @SamuelCahyawijaya @holylovenia !
Could you please review the new schema and implementation? I named it chat
feature for now, feel free to suggest any change!
Hi @SamuelCahyawijaya @holylovenia !
Could you please review the new schema and implementation? I named it
chat
feature for now, feel free to suggest any change!
The schema looks great to me! Let us know if you've separated the schema and new task so we can approve it.
It's done @SamuelCahyawijaya @holylovenia .
It's done @SamuelCahyawijaya @holylovenia .
Could you please link the PR for the new schema and task here, @patrickamadeus?
cc: @sabilmakbar because I'll put more focus on the experiments going forward.
Oops, sorry!
I put them altogether here in the last commit 😬 @holylovenia .
Should I create a separate PR for it? Sorry for my ignorance.
Oops, sorry!
I put them altogether here in the last commit 😬 @holylovenia .
Should I create a separate PR for it? Sorry for my ignorance.
Yes, it'd be great if we could have a separate PR. Thanks in advance, @patrickamadeus!!
Hi! here is the chat
schema PR #679 @sabilmakbar @holylovenia
Quick question: What's the difference between using this new chat
schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.
Quick question: What's the difference between using this new
chat
schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.
TOD relies on belief state and system act apart from the utterances. In practice, most TOD works are a derivative of or follow the WOZ dataset's style, so it would be better to keep that schema for TOD.
Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @yongzx @SamuelCahyawijaya @sabilmakbar
Hi @patrickamadeus, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️
Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪
Thanks again!
cc: @yongzx @SamuelCahyawijaya @sabilmakbar
Closes #536
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.TESTS
NOTES
Please use
huggingface-cli
or insert your API_KEY totoken=
parameter inload_dataset
method since this is a gated dataset :smile: