Closed DimLight1998 closed 5 years ago
This is indeed a bit confusing. I was pretty stubborn with my spodernet library and the design is not great.
The thing is that there are different vocabularies, but when you convert tokens to IDs you can specify which vocabulary to use in the key2key
variable which mappes a token variable to a specific vocabulary. In other words, e2_multi1
gets mapped to the e1
vocabulary. You can see that in main.py preprocessing.
You can see how this key2key
variable gets processed if you look in the spodernet ConvertTokenToIdx processor.
So it should work correctly at run-time since e1_multi1
reuses the e1
vocabulary. Does this make sense?
I found that for the same entity, the index when it is a subject is different from which when it is an object. I think this is because spodernet didn't know the domains of the keys
e1
ande2_multi1
are the same (all entities). This inconsistency will lead to disordered encoding for subjects and objects.The inconsistent indexes can be observed by adding these two lines to the
main
function:On kinship dataset I got
Is this intentional or a bug?