Does yy input provide anything over regular tokenized input?

goodmami commented 7 years ago

I'm having trouble using yy mode when the yy tokens are stored in a [incr tsdb()] profile's item:i-input field. I could extract and store only the tokens in the profile, but is there any information in the yy data beyond the surface tokens that Jacy makes use of?

fcbond commented 7 years ago

I am pretty sure we use the POS for unknown words (just nouns and sahen).

The lexical entries are in gle.tdl. I show an example below.

generic_vn-lex := vn-trans1-lex & [SYNSEM.LKEYS.KEYREL.PRED #pred, STEM < "_genericvn" >, TOKENS.+LIST generic_token_list & < [+POS.+TAGS < "名詞-サ変接続:n-n" >, +PRED #pred ] > ].

generic_noun-lex := ordinary-nohon-n-lex & [ SYNSEM.LKEYS.KEYREL.PRED #pred, STEM < "_generic_noun" >, TOKENS.+LIST generic_token_list & < [+POS.+TAGS < "名詞-一般:n-n" >, +PRED #pred ] > ].

$ echo "バククが勉強する" | ace -g jacy.dat NOTE: lexemes do not span position 0 バクク'! NOTE: post reduction gap SKIP: バククが勉強する NOTE: ignoringバククが勉強する'

$ echo "バククが勉強する" | python utils/jpn2yy.py | ace -g jacy.dat -y SENT: (yy mode) [ LTOP: h0 INDEX: e2 [ e TENSE: pres MOOD: indicative PROG: - PERF: - ASPECT: default_aspect PASS: - SF: prop ] RELS: < [ udef_q_rel<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 ] RSTR: h5 BODY: h6 ] [ "_バクク_n_unknown_rel"<0:3> LBL: h7 ARG0: x3 ] [ "_benkyou_s_rel"<4:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: i8 ] > HCONS: < h0 qeq h1 h5 qeq h7 > ] ; (426 utterance_rule-decl-finite 0.892834 0 4 (425 head_subj_rule -0.567591 0 4 (422 hf-complement-rule -1.006413 0 2 (421 quantify-n-rule 0.166081 0 1 (32 generic_noun-lex 0.000000 0 1 ("バクク" 18 "token [ +FORM \"バクク\" +FROM \"0\" +TO \"3\" +ID diff-list [ LIST list LAST list ] +POS pos [ +TAGS cons [ FIRST \"名詞-一般:n-n\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS non_ne [ +INITIAL luk ] +TRAIT generic_trait +PRED \"_バクク_n_unknown_rel\" +CARG \"バクク\" ]"))) (45 ga 0.150269 1 2 ("が" 21 "token [ +FORM \"が\" +FROM \"3\" +TO \"4\" +ID diff-list [ LIST list LAST list ] +POS null_pos [ +TAGS null +PRBS null ] +CLASS non_ne [ +INITIAL luk ] +TRAIT native_trait +PRED predsort +CARG \"が\" ]"))) (424 vn-light-rule 0.853578 2 4 (46 benkyou-vn 0.000000 2 3 ("勉強" 22 "token [ +FORM \"勉強\" +FROM \"4\" +TO \"6\" +ID diff-list [ LIST list LAST list ] +POS null_pos [ +TAGS null +PRBS null ] +CLASS non_ne [ +INITIAL luk ] +TRAIT native_trait +PRED predsort +CARG \"勉強\" ]")) (423 kuru-lexeme-infl-rule 1.900016 3 4 (52 suru-light-stem 1.101217 3 4 ("する" 23 "token [ +FORM \"する\" +FROM \"6\" +TO \"8\" +ID diff-list [ LIST list LAST list ] +POS null_pos [ +TAGS null +PRBS null ] +CLASS non_ne [ +INITIAL luk ] +TRAIT native_trait +PRED predsort +CARG \"する\" ]")))))) NOTE: 1 readings, added 307 / 121 edges to chart (53 fully instantiated, 39 actives used, 23 passives used) RAM: 1903k

NOTE: parsed 1 / 1 sentences, avg 1903k, time 0.01390s

On Thu, Sep 14, 2017 at 4:15 AM, Michael Wayne Goodman < notifications@github.com> wrote:

I'm having trouble using yy mode when the yy tokens are stored in a [incr tsdb()] profile's item:i-input field. I could extract and store only the tokens in the profile, but is there any information in the yy data beyond the surface tokens that Jacy makes use of?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/53, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xqkJEPVYsVZstaPYubHE0hlBUd0Zks5siDfQgaJpZM4PWqeh .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 7 years ago

Thanks for the example. I thought it might be POS info, but I didn't know where it was being used. I tried a different example and didn't see any difference in the ability to parse an unknown, but maybe it was just a bad example.

fcbond commented 7 years ago

Yeah, there are only a couple of unknowns that we can handle.

On Thu, Sep 14, 2017 at 2:33 PM, Michael Wayne Goodman < notifications@github.com> wrote:

Thanks for the example. I thought it might be POS info, but I didn't know where it was being used. I tried a different example and didn't see any difference in the ability to parse an unknown, but maybe it was just a bad example.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/53#issuecomment-329387089, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xv87ZMgFvDGQLm5PdPPYKmXBGfyCks5siMi4gaJpZM4PWqeh .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 7 years ago

And to follow up to my original post, the reason I was having trouble parsing with YY tokens in the i-input field of a [incr tsdb()] profile was that I didn't use the -Y option with art, in addition to the -y option of ACE:

art -a 'ace -g grm.dat -y' -Y profile/

delph-in / jacy

Does yy input provide anything over regular tokenized input? #53