The format of Multiwoz dataset

amazon-science / tanl

Structured Prediction as Translation between Augmented Natural Languages

Apache License 2.0

130 stars 25 forks source link

The format of Multiwoz dataset #2

Closed yanzhangnlp closed 2 years ago

yanzhangnlp commented 3 years ago

Hi Giovanni,

Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?

Sincerely, Yan

iambabao commented 3 years ago

I have the same problem on ACE 05 NER dataset.

I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

Magolor commented 3 years ago

@iambabao

I have the same problem on ACE 05 NER dataset.

I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

Yes, but I believe modifying it by simply adding:

if 'label' not in x:
                    x['label'] = {
                        x['entity_label']:x['span_position'],
                    }

could work.

However, @giove91 , please add more links to all the datasets used in tanl if available. Most of the datasets reported in paper and defined in dataset.py are currently not provided with acquisition method, preprocessing scripts, or instructions. I would really appreciate it if you could complete the datasets.

giove91 commented 3 years ago

Hi, thanks for your interest in this project!

@yanzhangnlp We added the instructions to process the Multiwoz dataset (thanks @jasonkrone). Hope this helps!

@iambabao Apparently the version I downloaded from that link is not available anymore (it is different from the version that can be currently downloaded). Thanks @Magolor for providing a possible fix. I'll check and update the instructions.

MerrickWang1 commented 3 years ago

Hi,

The data files provided for the ACE2005 dataset are of .test, .train, and .dev file types. @iambabao how did you obtain .json files?

Here is where I am attempting to obtain the ACE2005 data: https://github.com/ShannonAI/mrc-for-flat-nested-ner/blob/master/ner2mrc/download.md https://drive.google.com/file/d/1iodaJ92dTAjUWnkMyYm8aLEi5hj3cseY/view

Thanks,

iambabao commented 3 years ago

Hi,

The data files provided for the ACE2005 dataset are of .test, .train, and .dev file types. @iambabao how did you obtain .json files?

Here is where I am attempting to obtain the ACE2005 data: https://github.com/ShannonAI/mrc-for-flat-nested-ner/blob/master/ner2mrc/download.md https://drive.google.com/file/d/1iodaJ92dTAjUWnkMyYm8aLEi5hj3cseY/view

Thanks,

The files are in JSON format, you can directly rename them.

David-Lee-1990 commented 2 years ago

@iambabao

I have the same problem on ACE 05 NER dataset. I download the ACE 05 NER dataset from the link provided in datasets.py and renamed it to {split}.ner.json, but it does not work :(

Yes, but I believe modifying it by simply adding:
if 'label' not in x:
                    x['label'] = {
                        x['entity_label']:x['span_position'],
                    }
could work.

However, @giove91 , please add more links to all the datasets used in tanl if available. Most of the datasets reported in paper and defined in dataset.py are currently not provided with acquisition method, preprocessing scripts, or instructions. I would really appreciate it if you could complete the datasets.

hey guys, after preprocess ace2005 ner dataset following guidence here, and run tanl , i get F1 = 88.3 (tanl paper is 84.9). Is there a bug or else?

giove91 commented 2 years ago

Interesting! Are the splits correct and have you used the same hyperparameters as in the paper? (50 epochs, initial learning rate 0.0005, ...)