GrammaticalFramework / gf-ud

Functions to analyse and manipulate dependency trees, as well as conversions between GF and dependency trees. The main use case is UD (Universal Dependencies), but the code is designed to be completely generic as for annotation scheme. This repository replaces the old gf-contrib/ud2gf code. It is also meant to be used in the 'vd' command of GF and replace the supporting code in gf-core in the future.
Other
7 stars 15 forks source link

Feature request: #auxfun macros (and other #funs too if feasible?) to distinguish word order #23

Open inariksit opened 2 years ago

inariksit commented 2 years ago

Current behaviour, it treats phrases like "Section 10" (apposition) and "10 sections" identically.

inariksit commented 2 years ago

Another issue: attachment of modifiers, suppose a phrase like

"each portion of a building separated by walls"

In dt, I get these two options:

#1
AdjCN
    ( AdvCN ( UseN portion_N )
        ( PrepNP of_Prep
            ( DetCN ( DetQuant IndefArt NumSg ) ( UseN building_N ) )
        )
    )
    ( PassVAgent separate_V
        ( DetCN (DetQuant IndefArt NumPl)  ( UseN wall_N ) )
    ): CN[2,3,4,5,6,8]

#LIN: "portion of a building separated by walls"

#2
AdvNP
    ( DetCN each_Det
        ( AdjCN ( UseN portion_N )
            ( PassVAgent separate_V
                ( DetCN (DetQuant IndefArt NumPl)  ( UseN wall_N ) )
          )
    )
   ( PrepNP of_Prep
        ( DetCN ( DetQuant IndefArt NumSg ) ( UseN building_N ) )
    ): NP[1,2,3,4,5,6,8]
#LIN: "portion separated by walls of a building"

However, dt doesn't contain the NP version of 1, which would be just to apply DetCN each_Det on that tree. I wonder if some pruning step removes the NP version of 1, because it covers as many words as 2? (I tried to run the example without pruneDevTree, but the particular sentence is very long and the program was taking a long time. If you think that might be the reason, I can produce a shorter version of the sentence and try again.)

In any case, I can only imagine that the NP-version of 1 would also be constructed, but it's thrown away before it can be prioritised. And I would like to prioritise it, because the attachment matches the word order: both "building" and "walls" are children of "portion", but in 1, building is more immediately attached.

inariksit commented 2 years ago

I can solve the particular case with an #auxfun that says, every time when a NOUN has an acl and nmod child, put nmod before acl. But this is not ideal for scalability.

With an explicit DISTANCE=-1* or similar, I could duplicate that rule to say that whatever is closer to the head in the original word order, gets attached first in the tree. This is tedious, but finite: there are finite amount of relations, and finite combinations that appear together in real life texts.

Could one make a more fundamental change in the algorithm that wouldn't require explicit instructions about word order? Like ranking higher trees whose subtrees are attached according to distance in the original string. I don't know if this is feasible at all/requires too much rewriting. I can get by with auxfuns, just thinking aloud here.

inariksit commented 2 years ago

Here's a conllu file to test with

1   Each    each    DET DT  _   2   det _   _
2   portion portion NOUN    NN  Number=Sing 10  nsubj   _   _
3   of  of  ADP IN  _   5   case    _   _
4   a   a   DET DT  Definite=Ind|PronType=Art   5   det _   _
5   building    building    NOUN    NN  Number=Sing 2   nmod    _   _
6   separated   separate    VERB    VBN Tense=Past|VerbForm=Part    2   acl _   _
7   by  by  ADP IN  _   8   case    _   _
8   walls   wall    NOUN    NNS Number=Plur 6   obl _   _
9   is  be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   10  cop _   _
10  separate    separate    ADJ JJ  Degree=Pos  0   root    _   SpacesAfter=\n