HITsz-TMG / GEMEL

Official implementation of our LREC-COLING 2024 paper "Generative Multimodal Entity Linking".
https://arxiv.org/abs/2306.12725
31 stars 3 forks source link

How to create prefix_tree_opt.pkl for new dataset? #3

Open zhiweihu1103 opened 6 months ago

zhiweihu1103 commented 6 months ago

Hi, fork. Nice work, I need to know how to create prefix_tree_opt.pkl for new dataset? I want to use this thought to another task.

Senbao-Shi commented 6 months ago

Hi, thank you for your question.

We use the initializer function of the Trie class to build the prefix tree. Make sure every title begins with 'bos' and ends with 'eos'.

from trie import Trie

tree = Trie(tittle_ids)
prefix_tree_dict = tree.trie_dict
with open(prefix_tree_file, 'wb') as f:
    pickle.dump(prefix_tree_dict, f)
zhiweihu1103 commented 6 months ago

Thanks for your quick reply. If I have a list of new entity names. Would you mind tell me how to get tittle_ids in the Trie(tittle_ids)?

Senbao-Shi commented 6 months ago

Just tokenize all the new entity names and get the input_ids as follows:

tittle_ids = [tokenizer(t)['input_ids'] for t in tittle]
zhiweihu1103 commented 6 months ago

Which model I need as tokenizer? If OPT, I need OPT tokenizer, if Llama, I need Llama tokenizer. Right?

Senbao-Shi commented 6 months ago

yep

zhiweihu1103 commented 6 months ago

Thx, let me try.

zhiweihu1103 commented 6 months ago

Hi, further question. If I have an entity 'Joe Biden', the input to Trie should be 'eos Joe Biden eos', right? I need to add eos and spaces to separate from Joe Biden.

Senbao-Shi commented 6 months ago

Yes, you can use this method, and you can test it with a small amount of data to see if it performs well.

zhiweihu1103 commented 6 months ago

Further question is, why for training, you pad right, for dev or test, you pad left?

zhiweihu1103 commented 6 months ago

would you mind provide the entity_list that you used to create tree.pkl file, I may need to test whether my code is correct.

zhiweihu1103 commented 6 months ago

I may need to give more details on how I generated tree_opt.pkl, and I hope I can get your help. First, I give some of my data scenarios:

I use the following code to generate the tree_opt.pkl file:

import pickle
import json

from trie import Trie
from transformers import AutoTokenizer

def create_trie_pkl(data_path, tokenizer_path, output_path):
    entity_name_list = []
    with open(data_path, 'r') as file:
        data = json.load(file)
    for single_data in data:
        entity_name_list.append(single_data['entity_name'])

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False)
    tittle_ids = [tokenizer(t)['input_ids'] for t in entity_name_list]
    tree = Trie(tittle_ids)
    prefix_tree_dict = tree.trie_dict
    with open(output_path, 'wb') as f:
        pickle.dump(prefix_tree_dict, f)

if __name__ == '__main__':
    data_path = './kb_entity.json'
    tokenizer_path = './opt-1.3b'
    output_path = './tree_opt.pkl'
    create_trie_pkl(data_path, tokenizer_path, output_path)

However, the generated tree_opt.pkl file is very small. I also uploaded it here. The size of your prefix_tree_opt.pkl file is 209M, but the size of the tree_opt.pkl I generated is only 2.9M. I don't understand how your prefix_tree_opt.pkl is generated. In particular, what exactly does the content of the entity_name_list you use look like? I don’t know what happened in the middle? Looking forward for your reply.

zhiweihu1103 commented 6 months ago

If it's convenient, can you leave a WeChat ID? Thank you for your help.

zhiweihu1103 commented 6 months ago

Sorry, I have solved this problem, I will close this issue.

Senbao-Shi commented 6 months ago

Sorry for the late response. We have provided guidelines on how to build and use a prefix tree for constrained decoding.

zhiweihu1103 commented 6 months ago

Great, thanks for your hard work.

zhiweihu1103 commented 6 months ago

Hi, folk. I want to know why you append the target_embed into item_feat in line 51 of model.py. Is there any label leakage? Our purpose is to predict the target, but the text you input contains the target.

https://github.com/Senbao-Shi/GEMEL/blob/2560d08f866f53134932d0298621122b746a9316/model.py#L51

zhiweihu1103 commented 6 months ago

Hi, would you mind provide the entity_name files of WikiDiverse and WikiMEL you used? I use the code you provided to create the tree.pkl file, find that the file size is very different from you provided, and I use my tree.pkl to test model, the results in w/o In-context Learning setting is only 65.95 on WikiDiverse dataset, however, after replace the tree.pkl with you provided, the performance is 77.85, I need to know whether the entity_name lists are different from you used. Thx.

zhiweihu1103 commented 6 months ago

I hope I can find you well, Sorry to bother you again, I hope you can share the entity_name_list you used to generate tree.pkl of WikiDiverse and WikiMEL datasets. Thanks again.

zhiweihu1103 commented 6 months ago

Hi, friend, anything update about the entity_list?

KarimAsh11 commented 1 day ago

Hello, Did you get a solution for this issue? Which entity list is used for the paper? that of GENRE? Thanks!

zhiweihu1103 commented 1 day ago

No, the author did not reply to me, even though I sent a separate email to ask about it. The prefix_tree_opt.pkl I generated myself based on the Benchmark dataset entity list is completely different from the one provided by the author.

KarimAsh11 commented 1 day ago

Ok thank you. Let's wait and hope for a reply I guess.

zhiweihu1103 commented 1 day ago

No, half a year has passed and there is still no hope, so don't count on it.