tokens pos tags - Githubissues

delph-in / pydelphin

Python libraries for DELPH-IN

https://pydelphin.readthedocs.io/

MIT License

79 stars 27 forks source link

tokens pos tags #341

Closed arademaker closed 1 year ago

arademaker commented 2 years ago

In the response.tokens().tokens, see that all pos are empty []:

>>> from delphin import ace
>>> response = ace.parse('erg.dat', 'Abrams chased Browne')
NOTE: parsed 1 / 1 sentences, avg 3043k, time 0.29906s
>>> response.tokens().tokens
[YYToken(id=51, start=2, end=3, lnk=<Lnk object <14:20> at 4350564432>, paths=[1], form='Browne', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=53, start=0, end=1, lnk=<Lnk object <0:6> at 4360089024>, paths=[1], form='Abrams', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=54, start=2, end=3, lnk=<Lnk object <14:20> at 4360086384>, paths=[1], form='Browne', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=55, start=1, end=2, lnk=<Lnk object <7:13> at 4360109168>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=58, start=1, end=2, lnk=<Lnk object <7:13> at 4360109504>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=59, start=1, end=2, lnk=<Lnk object <7:13> at 4360106288>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=60, start=2, end=3, lnk=<Lnk object <14:20> at 4360107056>, paths=[1], form='browne', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=61, start=2, end=3, lnk=<Lnk object <14:20> at 4360323136>, paths=[1], form='browne', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=62, start=0, end=1, lnk=<Lnk object <0:6> at 4360323184>, paths=[1], form='abrams', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=63, start=0, end=1, lnk=<Lnk object <0:6> at 4360323232>, paths=[1], form='abrams', surface=None, ipos=0, lrules=['null'], pos=[])]

But if I inspect the response dictionary, I do have the POS tags for each token in the initial key:

>>> response['tokens']
{'initial': '(1, 0, 1, <0:6>, 1, "Abrams", 0, "null", "NNP" 1.0) (2, 1, 2, <7:13>, 1, "chased", 0, "null", "NNP" 1.0) (3, 2, 3, <14:20>, 1, "Browne", 0, "null", "NNP" 1.0)', 'internal': '(51, 2, 3, <14:20>, 1, "Browne", 0, "null") (53, 0, 1, <0:6>, 1, "Abrams", 0, "null") (54, 2, 3, <14:20>, 1, "Browne", 0, "null") (55, 1, 2, <7:13>, 1, "chased", 0, "null") (58, 1, 2, <7:13>, 1, "chased", 0, "null") (59, 1, 2, <7:13>, 1, "chased", 0, "null") (60, 2, 3, <14:20>, 1, "browne", 0, "null") (61, 2, 3, <14:20>, 1, "browne", 0, "null") (62, 0, 1, <0:6>, 1, "abrams", 0, "null") (63, 0, 1, <0:6>, 1, "abrams", 0, "null")'}

Is that the expected behaviour?

arademaker commented 2 years ago

Surely the most reliable information is 'after' the analysis of the grammar. The TNT tagger said that can is a modal verb (4, 3, 4, <13:16>, 1, "can", 0, "null", "MD" 1.0) but the grammar overwrite this analysis:

>>> response = ace.parse('erg.dat', 'I opened the can.')
NOTE: parsed 1 / 1 sentences, avg 3556k, time 0.23248s
>>> response['tokens']
{'initial': '(1, 0, 1, <0:1>, 1, "I", 0, "null", "PRP" 1.0) (2, 1, 2, <2:8>, 1, "opened", 0, "null", "VBD" 1.0) (3, 2, 3, <9:12>, 1, "the", 0, "null", "DT" 1.0) (4, 3, 4, <13:16>, 1, "can", 0, "null", "MD" 1.0) (5, 4, 5, <16:17>, 1, ".", 0, "null", "." 1.0)', 'internal': '(76, 0, 1, <0:1>, 1, "I", 0, "null") (81, 1, 2, <2:8>, 1, "opened", 0, "null") (84, 4, 5, <16:17>, 1, ".", 0, "null") (85, 4, 5, <16:17>, 1, ".", 0, "null") (86, 1, 2, <2:8>, 1, "opened", 0, "null") (87, 1, 2, <2:8>, 1, "opened", 0, "null") (88, 2, 3, <9:12>, 1, "the", 0, "null") (89, 2, 3, <9:12>, 1, "the", 0, "null") (90, 3, 4, <13:16>, 1, "can", 0, "null") (91, 3, 4, <13:16>, 1, "can", 0, "null") (92, 0, 1, <0:1>, 1, "i", 0, "null") (93, 0, 1, <0:1>, 1, "i", 0, "null")'}
>>> response.result(0).tree()
['S', ['NP', ['NP', ['i']]], ['VP', ['V', ['V', ['opened']]], ['NP', ['DET', ['the']], ['N', ['N', ['N', ['can']]], ['PT', ['.']]]]]]

arademaker commented 2 years ago

Is there a easy way to get the POS from the trees? Maybe only inspecting the response.result(0).derivation()?

arademaker commented 2 years ago

The problem is that the interpretation of the derivation probably requires some knowledge about the grammar, right? For instance, the word can in the screenshot can be reasonable detected as noun because of the suffix n1 attached to the can_n1 identifier... this is the lexicon entry, right? But for the pronoun I, I don't have any suffix... Is it an identifier defined in ERG @danflick ?

danflick commented 2 years ago

The names of individual lexical entries do not (yet) follow consistent conventions, but lexical types do, so you might add the flag --udx when you invoke ace, which causes the derivation tree to include, appended to each lexical entry name, its lexical type. The first field of the lexical type name might be enough to give you the coarse-rained part-of-speech label you want. That full derivation tree is also rather hard for humans to read, so you might try the following invocation of ace, which produces a more readable derivation tree: ace -g erg.dat -1 --udx --rooted-derivations | sed -e 's/[ LTOP: [^)] ; //' -e 's/"token [ [^)]")/)/g' -e 's/[-0-9.][0-9][0-9] //g'

danflick commented 2 years ago

That was intended to be "coarse-grained part-of -speech", but "coarse-rained" does sound more plentiful.

arademaker commented 2 years ago

Thank you @danflick

goodmami commented 2 years ago

Is that the expected behaviour?

Yes. By default, the "internal" tokens are returned. If you provide tokenset='initial' it will instead give you the "initial" tokens, which have the POS (that is, assuming you get both token sets from the processor):

>>> from delphin import ace
>>> response = ace.parse('../erg-2018.dat', 'Abrams chased Browne.')
NOTE: parsed 1 / 1 sentences, avg 3185k, time 0.07318s
>>> response.tokens().tokens
[YYToken(id=63, start=2, end=3, lnk=<Lnk object <14:21> at 139638158685088>, paths=[1], form='Browne.', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=65, start=0, end=1, lnk=<Lnk object <0:6> at 139638158687872>, paths=[1], form='Abrams', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=66, start=1, end=2, lnk=<Lnk object <7:13> at 139638158688016>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=67, start=2, end=3, lnk=<Lnk object <14:21> at 139638158684368>, paths=[1], form='Browne.', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=70, start=1, end=2, lnk=<Lnk object <7:13> at 139638158687824>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=71, start=1, end=2, lnk=<Lnk object <7:13> at 139638157052896>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=72, start=0, end=1, lnk=<Lnk object <0:6> at 139638155280288>, paths=[1], form='abrams', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=73, start=0, end=1, lnk=<Lnk object <0:6> at 139638155265264>, paths=[1], form='abrams', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=74, start=2, end=3, lnk=<Lnk object <14:21> at 139638155270304>, paths=[1], form='browne.', surface=None, ipos=0, lrules=['null'], pos=[]), YYToken(id=75, start=2, end=3, lnk=<Lnk object <14:21> at 139638155267424>, paths=[1], form='browne.', surface=None, ipos=0, lrules=['null'], pos=[])]
>>> response.tokens(tokenset='initial').tokens
[YYToken(id=1, start=0, end=1, lnk=<Lnk object <0:6> at 139638155280192>, paths=[1], form='Abrams', surface=None, ipos=0, lrules=['null'], pos=[('NNP', 1.0)]), YYToken(id=2, start=1, end=2, lnk=<Lnk object <7:13> at 139638155279328>, paths=[1], form='chased', surface=None, ipos=0, lrules=['null'], pos=[('NNP', 1.0)]), YYToken(id=3, start=2, end=3, lnk=<Lnk object <14:20> at 139638155277744>, paths=[1], form='Browne', surface=None, ipos=0, lrules=['null'], pos=[('NNP', 1.0)]), YYToken(id=4, start=3, end=4, lnk=<Lnk object <20:21> at 139638155277504>, paths=[1], form='.', surface=None, ipos=0, lrules=['null'], pos=[('.', 1.0)])]

Is there a easy way to get the POS from the trees? Maybe only inspecting the response.result(0).derivation()?

The Result.tree() method returns the simply syntax tree if it can get it, and it tries to get it in two ways: (1) directly from the response, if provided; (2) extracted from the derivation tree, otherwise and if the node labels are encoded on the derivation. With ACE version 24 or higher PyDelphin will request the node labels on derivations. Dan's suggestion for --udx is useful for getting the lexical types.

goodmami commented 2 years ago

I will close this as it looks like the question is answered.

arademaker commented 2 years ago

ts = itsdb.TestSuite('../data/own-00')
dt, txt, ip = None, None, None
for row in tsql.select('i-id i-origin i-input i-comment p-input derivation tree', ts):
    dt, txt, ip = row[5], row[2], row[4]
    break
print(txt)
print(ip)
print(dt)
d = derivation.from_string(dt)
print(d)
[t.to_dict() for t in d.terminals()]

Something is adnormal here. See that in the profile, I have organize_v1@v_np*_le (the lexical type appended) but in the derivation tree serialized after being parsed, I lost that information.

organize anew, as after a setback; 

(1, 0, 1, <0:8>, 1, "organize", 0, "null", "VB" 1.0) (2, 1, 2, <9:13>, 1, "anew", 0, "null", "RB" 1.0) (3, 2, 3, <13:14>, 1, ",", 0, "null", "." 1.0) (4, 3, 4, <15:17>, 1, "as", 0, "null", "IN" 1.0) (5, 4, 5, <18:23>, 1, "after", 0, "null", "IN" 1.0) (6, 5, 6, <24:25>, 1, "a", 0, "null", "DT" 1.0) (7, 6, 7, <26:33>, 1, "setback", 0, "null", "NN" 1.0) (8, 7, 8, <33:34>, 1, ";", 0, "null", ":" 1.0)

(root_strict (2786 hd_imp_c 1.195202 0 8 (2785 hd-aj_scp-pr_c 1.937109 0 8 (2779 hd-aj_int-unsl_c 3.066571 0 3 (2777 hd_optcmp_c -0.155710 0 1 (2776 v_n3s-bse_ilr -0.618458 0 1 (281 organize_v1@v_np*_le 0.000000 0 1 ("organize" 146 "token [ +FORM \"organize\" +FROM \"0\" +TO \"8\" +ID *diff-list* [ LIST *cons* [ FIRST \"0\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \"VB\" +PRB \"1.0\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL + ] +....

root_strict (2786 hd_imp_c 1.1952 0 8 (2785 hd-aj_scp-pr_c 1.93711 0 8 (2779 hd-aj_int-unsl_c 3.06657 0 3 (2777 hd_optcmp_c -0.15571 0 1 (2776 v_n3s-bse_ilr -0.618458 0 1 (281 organize_v1 0 0 1 ("organize" 146 "token [ +FORM \"organize\" +FROM \"0\" +TO \"8\" +ID *diff-list* [ LIST *cons* [ FIRST \"0\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \"VB\" +PRB \"1.0\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL + ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \"<0:8>\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \"organize\" +TICK + +ONSET c-or-v-onset ]")))) (2778 hd-pct_c 2.24032 1 3 (275 anew_adv1 0 1 2 ("anew" 142 "token [ +FORM \"anew\" +FROM \"9\" +TO \"13\" +ID *diff-list* [ LIST *cons* [ FIRST \"1\" REST *list* ] LAST *list* ] ...

[{'form': 'organize',
  'tokens': [{'id': 146,
    'tfs': 'token [ +FORM \\"organize\\" +FROM \\"0\\" +TO \\"8\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"0\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"VB\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL + ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<0:8>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"organize\\" +TICK + +ONSET c-or-v-onset ]'}]},...

goodmami commented 2 years ago

It's not abnormal, but it is under-documented. By default, the string form of a Derivation object (as you'd get from print(d)) calls UDFNode.to_udf() and disables indentation:

https://github.com/delph-in/pydelphin/blob/77fc995f1208a06c9eafd18c2218851fb45d8eac/delphin/derivation.py#L70-L71

What you want is UDFNode.to_udx():

>>> from delphin import derivation
>>> # NOTE: I removed the token strings below to make it simpler
>>> d = derivation.from_string("""(598 np_frg_c -0.031509 0 2 (597 hdn_bnp_c -0.908638 0 2 (596 n-hdn_cpd-pl_c 0.594654 0 2 (593 n_pl_olr -0.107961 0 1 (72 dog_n1@n_-_c_le 0.000000 0 1 ("dogs" 49 50))) (595 w_period_plr 1.262408 1 2 (594 n_ms-cnt_ilr 0.837684 1 2 (62 sleep_n1@n_-_mc_le 0.001044 1 2 ("sleep." 47 48)))))))""")
>>> str(d) == d.to_udf(indent=None)
True
>>> print(d.to_udx())
(598 np_frg_c -0.031509 0 2
 (597 hdn_bnp_c -0.908638 0 2
  (596 n-hdn_cpd-pl_c 0.594654 0 2
   (593 n_pl_olr -0.107961 0 1
    (72 dog_n1@n_-_c_le 0 0 1
     ("dogs")))
   (595 w_period_plr 1.26241 1 2
    (594 n_ms-cnt_ilr 0.837684 1 2
     (62 sleep_n1@n_-_mc_le 0.001044 1 2
      ("sleep.")))))))

arademaker commented 2 years ago

UDF vs UDX, now I got the difference. I will try to find out more about them. I just discovered the preterminals() method, what is the difference between terminals and preterminals? Where can I find out about them?

oepen commented 2 years ago

On 6 Aug 2022, at 14:38, Alexandre Rademaker @.***> wrote:

UDF vs UDX, now I got the difference.

https://github.com/delph-in/docs/wiki/ItsdbDerivations

arademaker commented 2 years ago

Thank you @oepen, I take the opportunity to make some updates in the wiki, see https://github.com/delph-in/docs/wiki/ItsdbDerivations/_compare/b8be80ea45522def4f254a60ba83dcd66ceea626...713f1f50aa6c13808c5e9af528ec1d32c4cccc7e to make sure if you agree.

arademaker commented 2 years ago

Hi @goodmami may I ask a few more things?

The method to_dict from UDX objects gives me a dictionary. In particular, if I consult only the preterminals, I get the field tokens for each preterminal. Why is it named tokens instead of token (singular)? I was trying to find a situation where one or more tokens were grouped in a preterminal, but I didn't find it. At least, look up in 'I look up the can.' is not one example.
The result object gives me the tokens response.tokens(tokenset='initial').tokens, what is the difference between the internal and initial tokensets?
The derivation of one particular result from a response contains the tokens as its terminals, right? So can I expect a 1-1 correspondence between the response.tokens and the response.results(0).derivation().terminals()?

oepen commented 2 years ago

I was trying to find a situation where one or more tokens were grouped in a preterminal, but I didn't find it.

if i recall correctly, the preterminals correspond to (instantiated) lexical entries; a multi-word entry like e.g. ad hoc will have to tokens as its terminal daughters.

what is the difference between the internal and initial tokensets?

https://github.com/delph-in/docs/wiki/ErgTokenization_ComplexExample

arademaker commented 2 years ago

Oh, thank you @oepen!! Of course, the difficult question is what will be considered a multi-word entry. I confirmed your example of ad hoc, PyDelphin derivation.preterminals() gives me:

...
{'form': 'ad hoc',
  'tokens': [{'id': 123,
    'tfs': 'token [ +FORM \\"ad\\" +FROM \\"11\\" +TO \\"13\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"3\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"FW\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL - ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<11:13>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"ad\\" +TICK + +ONSET c-or-v-onset ]'},
   {'id': 125,
    'tfs': 'token [ +FORM \\"hoc\\" +FROM \\"14\\" +TO \\"17\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"4\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"FW\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE non_capitalized+lower +INITIAL - ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<14:17>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"hoc\\" +TICK + +ONSET c-or-v-onset ]'}]},
...

This in the lexicon seems to be related to entries where ORTH feature is a list with more than one string:

ad_hoc_a1 := aj_-_i_le &
 [ ORTH < "ad", "hoc" >,
   SYNSEM [ LKEYS.KEYREL.PRED "_ad+hoc_a_1_rel",
            PHON.ONSET voc ] ].

BTW, the entry above that in the lexicon.tdl of ERG is

2001_a_space_odyssey_n1 := n_-_pn_le &
 [ ORTH < "2001", "A", "Space", "Odyssey" >,
   SYNSEM [ LKEYS.KEYREL.CARG "2001_a_space_odyssey",
            PHON.ONSET con ] ].

and I confirmed that the thrid analysis returned of ERG makes 2001 A Space Odyssey one entry with many tokens.

{'form': '2001 a space odyssey',
  'tokens': [{'id': 195,
    'tfs': 'token [ +FORM \\"2001\\" +FROM \\"0\\" +TO \\"4\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"0\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG string +PRB string ] ] +CLASS card_or_year_ne [ +INITIAL + ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<0:4>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"2001\\" +TICK + +ONSET c-onset ]'},
   {'id': 193,
    'tfs': 'token [ +FORM \\"a\\" +FROM \\"5\\" +TO \\"6\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"1\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"DT\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE capitalized+non_mixed +INITIAL - ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<5:6>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"A\\" +TICK + +ONSET c-or-v-onset ]'},
   {'id': 189,
    'tfs': 'token [ +FORM \\"space\\" +FROM \\"7\\" +TO \\"12\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"2\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"NNP\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE capitalized+lower +INITIAL - ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<7:12>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"Space\\" +TICK + +ONSET c-or-v-onset ]'},
   {'id': 191,
    'tfs': 'token [ +FORM \\"odyssey\\" +FROM \\"13\\" +TO \\"20\\" +ID *diff-list* [ LIST *cons* [ FIRST \\"3\\" REST *list* ] LAST *list* ] +TNT null_tnt [ +TAGS *null* +PRBS *null* +MAIN tnt_main [ +TAG \\"NNP\\" +PRB \\"1.0\\" ] ] +CLASS alphabetic [ +CASE capitalized+lower +INITIAL - ] +TRAIT token_trait [ +UW - +IT italics +LB bracket_null [ LIST *list* LAST *list* ] +RB bracket_null [ LIST *list* LAST *list* ] +LD bracket_null [ LIST *list* LAST *list* ] +RD bracket_null [ LIST *list* LAST *list* ] +HD token_head [ +TI \\"<13:20>\\" +LL ctype [ -CTYPE- string ] +TG string ] ] +PRED predsort +CARG \\"Odyssey\\" +TICK + +ONSET c-or-v-onset ]'}]}

So it seems that multi tokens entries are only for the words-with-spaces case as explained in

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. "Multiword expressions: A pain in the neck for NLP." In International conference on intelligent text processing and computational linguistics, pp. 1-15. Springer, Berlin, Heidelberg, 2002.

arademaker commented 2 years ago

For the records, regarding the questions above to @goodmami:

a terminal node in the derivation tree can correspond to one or more tokens in the input (stages b and c from https://github.com/delph-in/docs/wiki/ErgTokenization_ComplexExample).
From https://github.com/delph-in/docs/wiki/ErgTokenization_ComplexExample. We have 3 levels of processing. Unfortunately, https://github.com/delph-in/docs/wiki/ErgTokenization is incomplete. I didn't get why:

The parser-internal token mapping phase seeks to rewrite the initial tokens into a form that meets the ERG-internal assumptions about tokenization.

Aren't the REPP rules already adapted to the internal grammar needs? But I got that maybe at this stage, the token may have been instantiated in a few many ways. In the fragment The cat., the token cat correspond to 3 different analyses.

Note that only level (a) has a 'flat' form, i.e., forms a single sequence of tokens, whereas levels (b) and (c) will typically take the form of a lattice, i.e., admitting token-level ambiguity.

>>> response.tokens().tokens
[YYToken(id=49, start=1, end=2, lnk=<Lnk object <4:7> at 4973575568>, paths=[1], form='cat', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=52, start=2, end=3, lnk=<Lnk object <7:8> at 4973576096>, paths=[1], form='.', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=53, start=2, end=3, lnk=<Lnk object <7:8> at 4973575328>, paths=[1], form='.', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=54, start=1, end=2, lnk=<Lnk object <4:7> at 4972354048>, paths=[1], form='cat', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=55, start=1, end=2, lnk=<Lnk object <4:7> at 4972354528>, paths=[1], form='cat', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=56, start=0, end=1, lnk=<Lnk object <0:3> at 4907928160>, paths=[1], form='the', surface=None, ipos=0, lrules=['null'], pos=[]),
 YYToken(id=57, start=0, end=1, lnk=<Lnk object <0:3> at 4949731984>, paths=[1], form='the', surface=None, ipos=0, lrules=['null'], pos=[])]

Unfortunately, I didn't understand how to get anything more from the tokens not used in derivations (in this case, 49 and 55). The token 54 was used, and its feature structure is attached to the unique possible derivation tree for this fragment. But the ORTH feature from the lexical entry was not preserved in any place, right? It would be nice to have it; that would give us the lemma, right?

I learned a lot by calling Ace with

ace -g erg.dat --ubx --tsdb-stdout

and inspecting the output.

So the token set initial is populated with the value of :p-input in the ace tsdb-stdout format. The set internal is the value of :p-tokens. The only link between the entries on these sets is the <from:to>. I see one entry in the :p-tokens such as:

(54, 1, 2, <4:7>, 1, "cat", 0, "null")

that corresponds to the above:

 YYToken(id=54, start=1, end=2, lnk=<Lnk object <4:7> at 4972354048>, paths=[1], form='cat', surface=None, ipos=0, lrules=['null'], pos=[]),

What documentation describes this structure? Consulting delphin/tokens.py, the last field is the lexical rules, but how can we have it populated? In https://pydelphin.readthedocs.io/en/latest/api/delphin.tokens.html, @goodmami seems not to have an answer for that.

For that, the internal tokens (the :p-tokens) will be referenced in the derivations. But the :p-input will not. But the information from the initial tokens is copied to the fstruct of internal tokens; I see the TNT tag in the tnt_main.

So, I still have some questions... but I am learning a lot about this low-level processing. I am curious to see how LKB handles all of this... maybe @john-a-carroll can have some pointers. Is there any way to have the tsdb output in the LKB console from a result?

oepen commented 2 years ago

Aren't the REPP rules already adapted to the internal grammar needs?

the distinction between initial and internal tokenization – as created by REPP and token mapping, respectively – was motivated by a desire for interoperability with third-party tools, e.g. PoS taggers and statistical parsers trained on the venerable PTB. from around 2005 to 2020 or so, dan and i analyzed punctuation as pseudo-affixes, creating a major discrepancy between the initial and internal tokenization universes.

dan has more recently reworked the treatment of punctuation, but some discrepancies remain (for good reasons). also, there is an important formal discrepancy: initial tokenization is a plain sequence, whereas internal tokenization is a lattice. we are fairly convinced we need the ability to introduce token-level ambiguity prior to lexical instantiation, but for invoking a third-party PoS tagger or calling out to a statistical parser it is almost a prerequisite to constrain initial tokenization to a sequence.

But the ORTH feature from the lexical entry was not preserved in any place, right? It would be nice to have it; that would give us the lemma, right?

the ORTH feature is an element of the feature structure associated with the derivation, hence it is preserved (if you will) there. it is not recorded in the derivation proper, as it would be redundant.

i can see that some redundancy at times could be convenient. in my own work with post-processing parsing results, e.g. conversion to bi-lexical dependencies in the SDP and MRP campaigns, i typically had to consider the full context, i.e. actually reconstruct the derivation and probe its feature structure, plus look at the original initial tokenization objects, i.e. the p-input field in the profile. for all i know, reconstruction of derivations currently only is supported in Lisp code :-).

So the token set initial is populated with the value of :p-input in the ace tsdb-stdout format. The set internal is the value of :p-tokens. The only link between the entries on these sets is the <from:to>.

there is an additional link, provided that the token mapping rules are tight: the +ID list on internal token objects references initial tokens by the YY identifier (see below). however, i vaguely recall that there are token mapping corner cases where we may end up with incomplete or otherwise imperfect+ID lists, maybe when token mapping splits up one initial token. from what i remember, that may have been the reason that i went back to the original initial tokens list when post-processing ERG derivations …

What documentation describes this structure?

there is something, including history that is no longer in active use, on this page:

https://github.com/delph-in/docs/wiki/PetInput

goodmami commented 1 year ago

@arademaker I stepped back from the conversation between you and Stephan, but I hope I have not dropped the ball somewhere. Do you have any remaining questions?

goodmami commented 1 year ago

I'm guessing the questions have been answered and I'll close this issue.