Closed rdeits closed 5 years ago
I think,
rather that we should replace synset.synset_type
with ptr.pos
in
https://github.com/JuliaText/WordNet.jl/blob/56659be70878230b5bb683e8038548b9cea6f791/src/operations.jl#L8
I think that is a mistake in this implementation.
Determining which index to look-in for the related synset is a property of the pointer,
not of the source sysset.
(The question of POS vs Sysset type is interesting but moot, as the only time it differs is for Satelite adjectives s
, and those never occur on ther right hand side of a relation (I checked with regex)).
Ramblings
A synset's pos
determines which file it is indexed in, adjectives are in data.adj
,
a sysnset's sysset_type
is a marking on it's line in that file, for most things it is one to one with the pos
, but for adjectives the data.adj
file contains both s
and a
.
The syset objects themselfs in wordnet.jl contain both pos
(from the file) and syset_type
(from the line).
And for indexing purposes the pos
is used. (both for our interal dict, and as mentioned for files)
If we look at 2 links of dict.adj
00459631 00 a 01 unclothed 0 019 ^ 00060656 a 0000 ! 00455759 a 0101 & 00460031 a 0000 & 00460299 a 0000 & 00460521 a 0000 & 00460697 a 0000 & 00460843 a 0000 & 00460973 a 0000 & 00461135 a 0000 & 00461243 a 0000 & 00461363 a 0000 & 00461476 a 0000 & 00461586 a 0000 & 00461779 a 0000 & 00461914 a 0000 & 00461986 a 0000 & 00462109 a 0000 & 00462190 a 0000 & 00462329 a 0000 | not wearing clothing
00460031 00 s 04 bare 0 au_naturel(p) 0 naked 0 nude 0 007 & 00459631 a 0000 + 14479883 n 0401 + 10385098 n 0401 + 14479586 n 0402 + 14479586 n 0403 + 14479586 n 0301 + 14480341 n 0101 | completely unclothed; "bare bodies"; "naked from the waist up"; "a nude model"
We can see 00460031 00 s 04 bare
which is what you are querying about.
having a relation & 00459631 a 0000
which reads as similar to 00459631 a 0000
which is I believe 00459631 00 a 01 unclothed
.
Now the thing we ned to extract out of a relation line (& 00459631 a 0000
) info about which index to look in (which is our dict currently keyed on pos
) and the offset.
So for the purpose of this the a
tells us which index (this is ptr.pos
), and the 00459631
tells us the offset (which is basically an ID for a synset).
The first 00 tells us the source and the last 00 tells us the target, idk what that actually means though.
(x-posted from Slack)
Looking up the similar adjectives for some synsets throws a key error:
This seems to be resolved by just replacing
synset.synset_type
withsynset.synset_type == 's' ? 'a' : synset.synset_type
in https://github.com/JuliaText/WordNet.jl/blob/ee4a0c3fa43fce60d0802711033da8897570ed03/src/operations.jl#L8 but I'm not sure if that's the right thing to do.