JuliaText / WordNet.jl

A Julia package for Princeton's WordNet®.
Other
34 stars 11 forks source link

Missing `'s'` key when looking up similar adjectives #15

Closed rdeits closed 5 years ago

rdeits commented 5 years ago

(x-posted from Slack)

Looking up the similar adjectives for some synsets throws a key error:

julia> db = DB()
WordNet.DB

julia> lemma = db['a', "bare"]
bare.a

julia> WordNet.relation(db, synsets(db, lemma)[1], WordNet.SIMILAR_TO)
ERROR: KeyError: key 's' not found
Stacktrace:
 [1] getindex at ./dict.jl:478 [inlined]
 [2] #25 at /home/rdeits/.julia/dev/WordNet/src/operations.jl:10 [inlined]
 [3] iterate at ./generator.jl:47 [inlined]
 [4] _collect(::Array{WordNet.Pointer,1}, ::Base.Generator{Array{WordNet.Pointer,1},getfield(WordNet, Symbol("##25#27")){DB,Synset}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:632
 [5] map at ./array.jl:561 [inlined]
 [6] relation(::DB, ::Synset, ::String) at /home/rdeits/.julia/dev/WordNet/src/operations.jl:9
 [7] top-level scope at none:0

This seems to be resolved by just replacing synset.synset_type with synset.synset_type == 's' ? 'a' : synset.synset_type in https://github.com/JuliaText/WordNet.jl/blob/ee4a0c3fa43fce60d0802711033da8897570ed03/src/operations.jl#L8 but I'm not sure if that's the right thing to do.

oxinabox commented 5 years ago

I think, rather that we should replace synset.synset_type with ptr.pos in https://github.com/JuliaText/WordNet.jl/blob/56659be70878230b5bb683e8038548b9cea6f791/src/operations.jl#L8

I think that is a mistake in this implementation. Determining which index to look-in for the related synset is a property of the pointer, not of the source sysset. (The question of POS vs Sysset type is interesting but moot, as the only time it differs is for Satelite adjectives s, and those never occur on ther right hand side of a relation (I checked with regex)).


Ramblings

A synset's pos determines which file it is indexed in, adjectives are in data.adj, a sysnset's sysset_type is a marking on it's line in that file, for most things it is one to one with the pos, but for adjectives the data.adj file contains both s and a.

The syset objects themselfs in wordnet.jl contain both pos (from the file) and syset_type (from the line). And for indexing purposes the pos is used. (both for our interal dict, and as mentioned for files)

If we look at 2 links of dict.adj

00459631 00 a 01 unclothed 0 019 ^ 00060656 a 0000 ! 00455759 a 0101 & 00460031 a 0000 & 00460299 a 0000 & 00460521 a 0000 & 00460697 a 0000 & 00460843 a 0000 & 00460973 a 0000 & 00461135 a 0000 & 00461243 a 0000 & 00461363 a 0000 & 00461476 a 0000 & 00461586 a 0000 & 00461779 a 0000 & 00461914 a 0000 & 00461986 a 0000 & 00462109 a 0000 & 00462190 a 0000 & 00462329 a 0000 | not wearing clothing  

00460031 00 s 04 bare 0 au_naturel(p) 0 naked 0 nude 0 007 & 00459631 a 0000 + 14479883 n 0401 + 10385098 n 0401 + 14479586 n 0402 + 14479586 n 0403 + 14479586 n 0301 + 14480341 n 0101 | completely unclothed; "bare bodies"; "naked from the waist up"; "a nude model"  

We can see 00460031 00 s 04 bare which is what you are querying about. having a relation & 00459631 a 0000 which reads as similar to 00459631 a 0000 which is I believe 00459631 00 a 01 unclothed.

Now the thing we ned to extract out of a relation line (& 00459631 a 0000) info about which index to look in (which is our dict currently keyed on pos) and the offset. So for the purpose of this the a tells us which index (this is ptr.pos), and the 00459631 tells us the offset (which is basically an ID for a synset). The first 00 tells us the source and the last 00 tells us the target, idk what that actually means though.