Alibaba-NLP / ACE

[ACL-IJCNLP 2021] Automated Concatenation of Embeddings for Structured Prediction
Other
296 stars 44 forks source link

Test PTB Dependency Parsing Model #47

Open woshiyyya opened 1 year ago

woshiyyya commented 1 year ago

Hi there!

I am trying to test with your pretrained dependency parsing model. However, I cannot find your processed PTB dataset. Can you share it with a link?

Also, I am wondering how to inference with my own data. For example, how can I feed one sentence and get its tagging result?

wangxinyu0922 commented 1 year ago

I have just uploaded the ptb dataset on onedrive.

For inference, you may make a file like this (add dummy tags in the 7,8,9-th column) and follow the instruction:

1\tBut\t_\t_\t_\t_\t_\t0\troot\t0:root
2\tI\t_\t_\t_\t_\t_\t0\troot\t0:root
3\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
4\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
5\tlocation\t_\t_\t_\t_\t_\t0\troot\t0:root
6\twonderful\t_\t_\t_\t_\t_\t0\troot\t0:root
7\tand\t_\t_\t_\t_\t_\t0\troot\t0:root
7.1\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
8\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
9\tneighbors\t_\t_\t_\t_\t_\t0\troot\t0:root
10\tvery\t_\t_\t_\t_\t_\t0\troot\t0:root
11\tkind\t_\t_\t_\t_\t_\t0\troot\t0:root
12\t.\t_\t_\t_\t_\t_\t0\troot\t0:root
woshiyyya commented 1 year ago

Hi Xinyu,

Thanks for uploading the data!

I created a folder named data and put a train.tsv file with the demo case you provide.

Run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order

But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

wangxinyu0922 commented 1 year ago

Have you checked whether the datasets is at the correct place?

lizhou21 commented 1 year ago

I have just uploaded the ptb dataset on onedrive.

For inference, you may make a file like this (add dummy tags in the 7,8,9-th column) and follow the instruction:

1\tBut\t_\t_\t_\t_\t_\t0\troot\t0:root
2\tI\t_\t_\t_\t_\t_\t0\troot\t0:root
3\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
4\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
5\tlocation\t_\t_\t_\t_\t_\t0\troot\t0:root
6\twonderful\t_\t_\t_\t_\t_\t0\troot\t0:root
7\tand\t_\t_\t_\t_\t_\t0\troot\t0:root
7.1\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
8\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
9\tneighbors\t_\t_\t_\t_\t_\t0\troot\t0:root
10\tvery\t_\t_\t_\t_\t_\t0\troot\t0:root
11\tkind\t_\t_\t_\t_\t_\t0\troot\t0:root
12\t.\t_\t_\t_\t_\t_\t0\troot\t0:root

Hi Xinyu, Is there something wrong with the data format provided? i just find, the code token = Token(fields[1], head_id=int(fields[6])) shows me ValueError: invalid literal for int() with base 10: '_'.

So I guess the 0-th column is token id, the 1-th column is token, the 2,3,4,5-th column is "", the 6-th column is 0, (dummy tags) the 7-th column is "", the 8-th column is "root", (dummy tags) the 9-th column is "0:root", (dummy tags)

is that right?

lizhou21 commented 1 year ago

Hi Xinyu,

Thanks for uploading the data!

I created a folder named data and put a train.tsv file with the demo case you provide.

Run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order

But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

after I change the data format, I also face the same problem. have you resolved it?

wangxinyu0922 commented 1 year ago

Hi Xinyu, Thanks for uploading the data! I created a folder named data and put a train.tsv file with the demo case you provide. Run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

after I change the data format, I also face the same problem. have you resolved it?

Have you ensured the path /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu exist? If not, you may download the data above and put them at this path.

lizhou21 commented 1 year ago

yes! I have done it! and I solve this problem, it also needs to have dev/test datasets in the target_dir. But now I can parse the dataset with CPU(very slow), fail to run it with GPU set.

It shows me :

Traceback (most recent call last): File "train.py", line 378, in train_eval_result, train_loss = student.evaluate(loader,out_path=Path('outputs/train.'+'.'+tar_file_name+'.conllu'),embeddings_storage_mode="none",prediction_mode=True) File "/DM_parser/ACE/flair/models/dependency_model.py", line 1174, in evaluate arc_scores, rel_scores = self.forward(batch, prediction_mode=prediction_mode) File "/DM_parser/ACE/flair/models/dependency_model.py", line 597, in forward self.embeddings.embed(sentences,embedding_mask=self.selection) File "/DM_parser/ACE/flair/embeddings.py", line 185, in embed embedding.embed(sentences) File "/DM_parser/ACE/flair/embeddings.py", line 97, in embed self._add_embeddings_internal(sentences) File "/DM_parser/ACE/flair/embeddings.py", line 2960, in _add_embeddings_internal self._add_embeddings_to_sentences(sentences) File "/DM_parser/ACE/flair/embeddings.py", line 3155, in _add_embeddings_to_sentences sequence_output, pooled_output, hidden_states = self.model(input_ids, attention_mask=mask, inputs_embeds = inputs_embeds) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, *kwargs) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 68, in forward input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_bert.py", line 178, in forward inputs_embeds = self.word_embeddings(input_ids) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_index_select

I try to set sequence_output, pooled_output, hidden_states = self.model(input_ids, attention_mask=mask, inputs_embeds = inputs_embeds)

into

sequence_output, pooled_output, hidden_states = self.model(input_ids.cuda(), attention_mask=mask.cuda(), inputs_embeds = inputs_embeds)

it also shows me the same question.

T T,

wangxinyu0922 commented 1 year ago

You may try to uncomment these lines https://github.com/Alibaba-NLP/ACE/blob/7033e91b5428bfbf33c75a4c81f2336f03115ed8/train.py#L226-L238

lizhou21 commented 1 year ago

You may try to uncomment these lines

https://github.com/Alibaba-NLP/ACE/blob/7033e91b5428bfbf33c75a4c81f2336f03115ed8/train.py#L226-L238

hi Xinyu, I have resolved the problem, and applied ACE to my data parsing successfully, thanks for your help.