delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

parsing TFS from tokens #349

Open arademaker opened 2 years ago

arademaker commented 2 years ago

Do we have any method to parse the TFS from tokens?

token [
+FORM "cats"
+FROM "4"
+TO "8"
+ID *diff-list* [ LIST *cons* [ FIRST "1" REST *list* ] LAST *list* ]
+TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG "NNS" +PRB "1.0" ] ]
+CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL - ]
+TRAIT token_trait [ 
 +UW -
 +IT italics
 +LB bracket_null [ LIST *list* LAST *list* ]
 +RB bracket_null [ LIST *list* LAST *list* ]
 +LD bracket_null [ LIST *list* LAST *list* ]
 +RD bracket_null [ LIST *list* LAST *list* ]
 +HD token_head [ +TI "<4:8>"
   +LL ctype [ -CTYPE- string ]
   +TG string ] ]
+PRED predsort
+CARG "cats"
+TICK +
+ONSET c-or-v-onset ]
oepen commented 2 years ago

Do we have any method to parse the TFS from tokens?

please see lkb::read-dag() in

http://svn.delph-in.net/lkb/trunk/src/glue/dag.lsp

in [incr tsdb()] this is invoked by tsdb::reconstruct(), which will recreate the full feature structure associated with the derivation, including any information 'infused' into the lexical entries from the underlying token feature stuctures, e.g. characterization.

arademaker commented 2 years ago

Hi @goodmami and @oepen,

[t.to_dict() for t in result.derivation().preterminals()]
[{'entity': 'the_1',
  'id': 149,
  'score': -1.639588,
  'start': 0,
  'end': 1,
  'type': 'd_-_the_le',
  'form': 'the',
  'tokens': [{'id': 91,
    'tfs': 'token [ +FORM \\"the\\" +FROM \\"0\\" +TO \\"3\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"0\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"DT\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL + ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<0:3>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"the\\" +TICK + +ONSET c-or-v-onset ]'}]},...

So I tried the LKB code with the string from the tfs field above, am I right @oepen ?

LKB> (read-dag "token [ +FORM \"the\" +FROM \"0\" +TO \"3\" +ID *diff-list* [ LIST *cons* [ FIRST \"0\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \"DT\" +PRB \"1.0\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL + ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \"<0:3>\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \"the\" +TICK + +ONSET c-or-v-onset ]")
NIL
goodmami commented 2 years ago

@arademaker addressing your initial question: no, I don't think I ever got around to adding support for parsing those token structures, but I had thought about it. The delphin.tfs.TypedFeatureStructure class should be capable of containing it once it's parsed, but this TFS format is slightly different from TDL (notice, e.g., there's no commas between feature values), so we can't just use the TDL parser.

oepen commented 2 years ago

So I tried the LKB code with the string from the tfs field above,

do you have the right grammar loaded? recreating the token feature structure requires the type hierarchy and constraints available, i.e. a complete unifier.