TimDettmers / ConvE

Convolutional 2D Knowledge Graph Embeddings resources
MIT License
675 stars 163 forks source link

Indexes for subjects and objects are inconsistent. #41

Closed DimLight1998 closed 5 years ago

DimLight1998 commented 5 years ago

I found that for the same entity, the index when it is a subject is different from which when it is an object. I think this is because spodernet didn't know the domains of the keys e1 and e2_multi1 are the same (all entities). This inconsistency will lead to disordered encoding for subjects and objects.

The inconsistent indexes can be observed by adding these two lines to the main function:

def main():
    if Config.process: preprocess(Config.dataset, delete_data=True)
    input_keys = ['e1', 'rel', 'rel_eval', 'e2', 'e2_multi1', 'e2_multi2']
    p = Pipeline(Config.dataset, keys=input_keys)
    p.load_vocabs()
    vocab = p.state['vocab']

    num_entities = vocab['e1'].num_token

    ######### add lines below ##############
    print(vocab['e1'].token2idx)
    print(vocab['e2_multi1'].token2idx)
    ########################################

    train_batcher = StreamBatcher(Config.dataset, 'train', Config.batch_size, randomize=True, keys=input_keys)
    dev_rank_batcher = StreamBatcher(Config.dataset, 'dev_ranking', Config.batch_size, randomize=False, loader_threads=4, keys=input_keys)
    test_rank_batcher = StreamBatcher(Config.dataset, 'test_ranking', Config.batch_size, randomize=False, loader_threads=4, keys=input_keys)

On kinship dataset I got

{'OOV': 0, '': 1, 'person100': 2, 'person80': 3, 'person37': 4, 'person72': 5, 'person49': 6, 'person39': 7, 'person12': 8, 'person87': 9, 'person10': 10, 'person48': 11, 'person63': 12, 'person36': 13, 'person45': 14, 'person68': 15, 'person40': 16, 'person82': 17, 'person90': 18, 'person13': 19, 'person17': 20, 'person69': 21, 'person103': 22, 'person0': 23, 'person65': 24, 'person11': 25, 'person9': 26, 'person92': 27, 'person62': 28, 'person102': 29, 'person66': 30, 'person70': 31, 'person73': 32, 'person18': 33, 'person60': 34, 'person26': 35, 'person50': 36, 'person89': 37, 'person38': 38, 'person81': 39, 'person14': 40, 'person21': 41, 'person53': 42, 'person67': 43, 'person28': 44, 'person24': 45, '$erson95': 46, 'person51': 47, 'person3': 48, 'person41': 49, 'person99': 50, 'person96': 51, 'person7': 52, 'person54': 53, 'person15': 54, 'person1': 55, 'person29': 56, 'person78': 57, 'person31': 58, 'person83': 59, 'person33': 60, 'pe$son2': 61, 'person58': 62, 'person52': 63, 'person79': 64, 'person27': 65, 'person32': 66, 'person76': 67, 'person85': 68, 'person101': 69, 'person8': 70, 'person35': 71, 'person23': 72, 'person5': 73, 'person64': 74, 'person77': 75, 'per$on86': 76, 'person93': 77, 'person91': 78, 'person4': 79, 'person42': 80, 'person22': 81, 'person19': 82, 'person47': 83, 'person20': 84, 'person46': 85, 'person25': 86, 'person75': 87, 'person71': 88, 'person56': 89, 'person43': 90, 'per$on88': 91, 'person6': 92, 'person57': 93, 'person94': 94, 'person61': 95, 'person16': 96, 'person98': 97, 'person55': 98, 'person97': 99, 'person74': 100, 'person84': 101, 'person30': 102, 'person34': 103, 'person59': 104, 'person44': 105}
{'OOV': 0, '': 1, 'person77': 2, 'person82': 3, 'person63': 4, 'person59': 5, 'person56': 6, 'person80': 7, 'person83': 8, 'person85': 9, 'person90': 10, 'person44': 11, 'person100': 12, 'person97': 13, 'person88': 14, 'person93': 15, 'pe$son103': 16, 'person30': 17, 'person18': 18, 'person35': 19, 'person26': 20, 'person89': 21, 'person32': 22, 'person36': 23, 'person87': 24, 'person81': 25, 'person71': 26, 'person72': 27, 'person78': 28, 'person67': 29, 'person69': 30, '$erson74': 31, 'person49': 32, 'person45': 33, 'person37': 34, 'person43': 35, 'person28': 36, 'person96': 37, 'person102': 38, 'person4': 39, 'person16': 40, 'person46': 41, 'person50': 42, 'person17': 43, 'person20': 44, 'person39': 45, $person66': 46, 'person12': 47, 'person8': 48, 'person48': 49, 'person15': 50, 'person5': 51, 'person0': 52, 'person1': 53, 'person7': 54, 'person10': 55, 'person9': 56, 'person19': 57, 'person34': 58, 'person86': 59, 'person29': 60, 'pers$n73': 61, 'person2': 62, 'person21': 63, 'person13': 64, 'person24': 65, 'person3': 66, 'person14': 67, 'person53': 68, 'person23': 69, 'person62': 70, 'person98': 71, 'person25': 72, 'person40': 73, 'person99': 74, 'person92': 75, 'perso$27': 76, 'person76': 77, 'person79': 78, 'person31': 79, 'person95': 80, 'person75': 81, 'person33': 82, 'person70': 83, 'person38': 84, 'person42': 85, 'person91': 86, 'person68': 87, 'person41': 88, 'person94': 89, 'person64': 90, 'pers$n101': 91, 'person55': 92, 'person84': 93, 'person60': 94, 'person6': 95, 'person11': 96, 'person54': 97, 'person65': 98, 'person22': 99, 'person51': 100, 'person61': 101, 'person58': 102, 'person57': 103, 'person52': 104, 'person47': 105}

Is this intentional or a bug?

TimDettmers commented 5 years ago

This is indeed a bit confusing. I was pretty stubborn with my spodernet library and the design is not great.

The thing is that there are different vocabularies, but when you convert tokens to IDs you can specify which vocabulary to use in the key2key variable which mappes a token variable to a specific vocabulary. In other words, e2_multi1 gets mapped to the e1 vocabulary. You can see that in main.py preprocessing.

You can see how this key2key variable gets processed if you look in the spodernet ConvertTokenToIdx processor.

So it should work correctly at run-time since e1_multi1 reuses the e1 vocabulary. Does this make sense?