English PTB to UD 2.0 - Githubissues

arademaker commented 4 years ago

Does anyone know that is the best approach to convert a treebank in PTB format to UD 2.0? I found the page https://nlp.stanford.edu/software/stanford-dependencies.html, but it is not clear if the code supports UD 2.0. Suggestions are welcome.

AngledLuffa commented 5 months ago

I'll follow up with my PI on why our current ccomp rules aren't catching begin -> notes in the above sentence. Aside from that, a lot of the low hanging fruit is now cleaned up. Please let me know if/when there's a standard on the numbered lists, since that created a bunch of validation errors in the converter

amir-zeldes commented 5 months ago

I'm willing to update the validator unless @amir-zeldes strongly objects

I'm not sure I follow - you want to add "coulda" etc. as allowed aux to the validator just to help accommodate PTB conversion outputs? Or do you want to make the guidelines say that "coulda" should not be tokenized?

I don't feel very strongly about the first, but I think the second is wrong - it should be tokenized apart just like "gonna" & co., as we've been doing this kind of splitting in UD English for a variety of cases.

As for the first, perhaps it's a bit misleading if the guidelines say that splitting is right - shouldn't the validator serve to warn users that an output is not what UD expects?

nschneid commented 5 months ago

We have been following Penn tokenization for "gonna" etc. right? We can quibble with Penn tokenization policies but they have the advantage of being established, and I worry that making exceptions for rare special cases could lead to surprising/annoying misalignments.

nschneid commented 5 months ago

Please let me know if/when there's a standard on the numbered lists, since that created a bunch of validation errors in the converter

It may not be finalized for a few weeks, but Chris has access to the draft proposal—you can ask him what would be the cleanest way to tweak the rules to pass validation while moving in the direction things are headed (or if he thinks that direction is wrong).

AngledLuffa commented 5 months ago

As for the first, perhaps it's a bit misleading if the guidelines say that splitting is right - shouldn't the validator serve to warn users that an output is not what UD expects?

This seems like a reasonable argument. It's just a situation with another unfixable validator error after using the converter on PTB. I don't mind, since there are a lot of unfixable errors at this point anyway

It may not be finalized for a few weeks, but Chris has access to the draft proposal—you can ask him what would be the cleanest way to tweak the rules to pass validation while moving in the direction things are headed (or if he thinks that direction is wrong).

I would expect the tag changes to lists won't require more linguistic knowledge than I have to implement - what's the draft proposal look like?

amir-zeldes commented 5 months ago

We have been following Penn tokenization for "gonna" etc. right

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

AngledLuffa commented 5 months ago

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Agreed on that, but if a user gives the converter a tree with one of these contractions as a single word, I think it would be incorrect for the converter to split it for them. Similar to my current belief that it should return the same XPOS the user gives (via the tree) and the UPOS should correspond to the XPOS, even if that means the dependencies created violate the validator's rules about UPOS.

So basically there's a whole bunch of errors the validator will flag on the output of the converter when given PTB unless we start editing the input trees in ways which users would find surprising

amir-zeldes commented 5 months ago

Yeah, that all sounds right. In terms of silencing the validator about the results of that, I wouldn't lose too much sleep over it if you want to do it, but I sort of find it right if the validator throws a warning, since the output indeed does not correspond to the recommended UD English standard, so users should be warned.

dan-zeman commented 5 months ago

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Agreed on that, but if a user gives the converter a tree with one of these contractions as a single word, I think it would be incorrect for the converter to split it for them. Similar to my current belief that it should return the same XPOS the user gives (via the tree) and the UPOS should correspond to the XPOS, even if that means the dependencies created violate the validator's rules about UPOS.

So basically there's a whole bunch of errors the validator will flag on the output of the converter when given PTB unless we start editing the input trees in ways which users would find surprising

Seems like we have different expectations of what "a converter to UD" is. FWIW, in my converter of Czech PDT to UD, I want the output to be as good/valid UD as possible given the input. I am not trying to output something that will be as close as possible to the original PDT, just with some UD labeling in places where it does not hurt feelings of users who live outside UD.

AngledLuffa commented 5 months ago

I want the output to be as good/valid UD as possible given the input

Well, we do have a PTB Correcting script which fixes up a bunch of known errors (mostly tags, but could include retokenizing mighta or whatever) in PTB. I could envision connecting that to the converter as an optional feature.

AngledLuffa commented 5 months ago

ps. not that I was offended by your wording, but "hurt feelings" or unmatched expectations tend to express themselves in the form of github issues

nschneid commented 5 months ago

I want the output to be as good/valid UD as possible given the input

Well, we do have a PTB Correcting script which fixes up a bunch of known errors (mostly tags, but could include retokenizing mighta or whatever) in PTB. I could envision connecting that to the converter as an optional feature.

How well-established is this convention in Penn data? Is it just a one-off where they neglected to tokenize "mighta" or is it a repeated thing? I couldn't find other tokens in OntoNotes but not sure if I was searching correctly. If it only comes up once or twice in the Penn trees then I don't think UD should necessarily enact a policy just to accommodate that. But if it's a clear policy of PTB then we should be prepared to either convert or accommodate that tokenization in UD.

sylvainkahane commented 5 months ago

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Segmenting "gonna" into "gon" + "na" has to be justified. We have already discussed this in #1006. If we look at all the realisations of the lexeme TO (lemma=to) in UD_English-GUM (and if we exclude orthographic variations), we have 4 realisations (gon-na, ought-a, got-ta). It would be probably more justified to consider that TO has only two allomorphes, to and a.

But the real problem is that we don't have any criterion to decide how to segment this kind of words in UD and how to choose the form of their parts.

amir-zeldes commented 5 months ago

It would be probably more justified to consider that TO has only two allomorphes, to and a

Maybe - it wouldn't have been hard to do "gonn a" if we were doing this from scratch, but since Penn corpora already went with "gon na", I don't mind having a third form too much. At least we're consistent with other English corpora this way.

nschneid commented 5 months ago

Plus, as a practical matter, less risk of POS taggers mislabeling the "a" as a determiner!

AngledLuffa commented 5 months ago

could always split it as migh ta woul da etc or maybe that just looks really bad

splitting it as might have seems like a violation of the general principle that we leave pieces in such a form that they get combined to rebuild the original text. although we only generally do that in English, whereas other languages with MWT have their own split schemes that often leave the pieces different from the original

AngledLuffa commented 4 months ago

i spent longer than necessary fixing up ccomp relations (and therefore the associated UD converter errors) in sentences such as

"Working on this issue is a pain," complained AngledLuffa

ccomp(complained, pain)

One sentence that still goes wonky from PTB is the following:

( (S
    (S-TPC-2
      (NP-SBJ
        (NP (DT Those) )
        (VP (VBN employed)
          (NP (-NONE- *) )
          (PP-LOC-CLR (IN in)
            (NP (JJ state-funded) (JJ special) (NNS programs) ))))
      (VP (VBN increased)
        (PP-EXT (IN by)
          (NP (CD 7,400) ))
        (PP-DIR (TO to)
          (NP (CD 65,200) ))
        (PP-TMP (IN in)
          (NP (DT the) (JJ same) (NN period) )))
      (, ,) )
    (NP-SBJ (DT the) (NNP Directorate) )
    (VP (VBD said)
      (SBAR (-NONE- 0)
        (S (-NONE- *T*-2) )))
    (. .) ))

In this case, I believe increased is mistagged and should be VBD, since it is something that actively happened rather than the participle use. Does that sound right?

I can make a new release of CoreNLP which greatly reduces the number of errors in converted PTB once I wrap up this tiny change, but completely eliminating them with a deterministic converter is optimistic

AngledLuffa commented 4 months ago

... ultimately I don't see a difference in the verb usage in the following sentences, but I'm happy to be told how to count the angels dancing on this pin:

It adopted_VBD a takeover plan ...
He collaborated_VBD with ...
He launched_VBD into ...

Most yields ... moved_VBN in the opposite direction
The board increased_VBN by one
Those employed in state-funded special programs increased_VBN by ...
The dollar gained_VBN against most foreign currencies

Compare to the following, which is a condition or something the NP had done to it rather than something the NP did:

With Japan's cash-flush banks aligned_VBN ...
... had their budgets cut_VBN in half ...
... more than 6.6 million ADRs traded_VBN

nschneid commented 4 months ago

Most yields ... moved_VBN in the opposite direction The board increased_VBN by one Those employed in state-funded special programs increased_VBN by ... The dollar gained_VBN against most foreign currencies

These should all be VBD. PTB has a lot of tagger errors that annotators missed.

AngledLuffa commented 4 months ago

Good, glad to know it wasn't me misunderstanding. Thanks for checking

AngledLuffa commented 4 months ago

Are we happy with the conversion of about an hour in this following EWT sentence? advmod(hour, about) det(hour,an)

There are quite a few trees in PTB which have the about an inside a QP, presumably treating it similar to about one, and that makes our converter want to give it an nummod relation. However, the validator doesn't like that at all

# sent_id = reviews-288930-0003
# text = Try the 360 restraunt u spin in the cn tower with a beautiful view the sky pod elevator is about an hour line up in the summer
1       Try     try     VERB    VB      Mood=Imp|VerbForm=Fin   0       root    0:root  _
2       the     the     DET     DT      Definite=Def|PronType=Art       4       det     4:det   _
3       360     360     NUM     CD      NumForm=Digit|NumType=Card      4       nummod  4:nummod        _
4       restraunt       restaurant      NOUN    NN      Number=Sing|Typo=Yes    1       obj     1:obj   CorrectForm=restaurant
5       u       you     PRON    PRP     Abbr=Yes|Case=Nom|Person=2|PronType=Prs 6       nsubj   6:nsubj CorrectForm=you
6       spin    spin    VERB    VBP     Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   1       parataxis       1:parataxis     _
7       in      in      ADP     IN      _       10      case    10:case _
8       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
9       cn      cn      PROPN   NNP     Number=Sing     10      compound        10:compound     _
10      tower   tower   PROPN   NNP     Number=Sing     6       obl     6:obl:in        _
11      with    with    ADP     IN      _       14      case    14:case _
12      a       a       DET     DT      Definite=Ind|PronType=Art       14      det     14:det  _
13      beautiful       beautiful       ADJ     JJ      Degree=Pos      14      amod    14:amod _
14      view    view    NOUN    NN      Number=Sing     6       obl     6:obl:with      _
15      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
16      sky     sky     NOUN    NN      Number=Sing     17      compound        17:compound     _
17      pod     pod     NOUN    NN      Number=Sing     18      compound        18:compound     _
18      elevator        elevator        NOUN    NN      Number=Sing     24      nsubj   24:nsubj        _
19      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   24      cop     24:cop  _
20      about   about   ADV     RB      _       22      advmod  22:advmod       _
21      an      a       DET     DT      Definite=Ind|PronType=Art       22      det     22:det  _
22      hour    hour    NOUN    NN      Number=Sing     24      compound        24:compound     _
23      line    line    NOUN    NN      Number=Sing     24      compound        24:compound     _
24      up      up      NOUN    NN      Number=Sing     1       parataxis       1:parataxis     _
25      in      in      ADP     IN      _       27      case    27:case _
26      the     the     DET     DT      Definite=Def|PronType=Art       27      det     27:det  _
27      summer  summer  NOUN    NN      Number=Sing     24      obl     24:obl:in       _

nschneid commented 4 months ago

Are we happy with the conversion of about an hour in this following EWT sentence? advmod(hour, about) det(hour,an)

Yes that's correct, it was specifically implemented it in an earlier EWT release: UniversalDependencies/UD_English-EWT#168

AngledLuffa commented 4 months ago

In terms of about an hour and its relatives, they are often (but not always) parsed into a structure such as

(NP (QP (RB about) (DT an)) (NN hour))

and for these I should probably get stuff like

(NP (QP (RB about) (DT a)) (NN month))
advmod(month, about)
det(month, a)

(NP (QP (RB virtually) (DT no)) (NN one))
advmod(one, virtually)
det(one, no)

(NP (QP (RB about) (DT a)) (NN week))
advmod(week, about)
det(week, a)

(NP (QP (RB nearly) (DT every)) (NN day))
advmod(day, nearly)
det(day, every)

Basically it seems to be QP with no CD is the first thing to look for... but then there are PDT that get captured if you just search for QP not over CD, such as in

(NP (QP (RB Not) (PDT all)) (DT those))

That looks pretty similar ... but the not gets treated differently from other RB in EWT, such as

# sent_id = reviews-079375-0007
... not all people ...
15      not     not     PART    RB      _       16      advmod  16:advmod       _
16      all     all     DET     DT      PronType=Tot    17      det     17:det  _
17      people  people  NOUN    NNS     Number=Plur     22      nsubj   22:nsubj        _

Also in EWT there is

# sent_id = reviews-190256-0005
... in about half the time quoted ...
7       about   about   ADV     RB      _       8       advmod  8:advmod        _
8       half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  10      det:predet      10:det:predet   _
9       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
10      time    time    NOUN    NN      Number=Sing     2       obl     2:obl:in        _

# sent_id = weblog-blogspot.com_dakbangla_20041028153019_ENG_20041028_153019-0001
28      almost  almost  ADV     RB      _       29      advmod  29:advmod       _
29      all     all     DET     PDT     PronType=Tot    32      det:predet      32:det:predet   _
30      the     the     DET     DT      Definite=Def|PronType=Art       32      det     32:det  _
31      Hindu   hindu   ADJ     JJ      Degree=Pos      32      amod    32:amod _
32      families        family  NOUN    NNS     Number=Plur     23      nmod    23:nmod:of|36:nsubj     _

which makes me think that if the structure is (QP RB DT) then the RB should depend on the thing the DT depends on, whereas if the structure is (QP RB PDT) then the RB depends directly on the PDT. Does that sound about right? It would also seem that the PDT has the det:predet relation, as opposed to a number type relation which the QP has been inducing in our converter.

Although there is also this:

# sent_id = reviews-036133-0002
# newpar id = reviews-036133-p0002
# text = I bought about half of the furniture I own from this place.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   2:nsubj _
2       bought  buy     VERB    VBD     Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin   0       root    0:root  _
3       about   about   ADV     RB      _       4       advmod  4:advmod        _
4       half    half    NOUN    NN      Number=Sing|NumForm=Word|NumType=Frac   2       obj     2:obj   _
5       of      of      ADP     IN      _       7       case    7:case  _
6       the     the     DET     DT      Definite=Def|PronType=Art       7       det     7:det   _
7       furniture       furniture       NOUN    NN      Number=Sing     4       nmod    4:nmod:of       _

It doesn't seem very different from I brought about half the furniture, and yet in one case half is the head of the NP whereas in the other furniture would have been treated as the head of the NP. I don't like it. Maybe the PP is enough to cause that difference, though.

There'd also be the question of if a word such as all shows up without another DT, such as

(ADJP virtually_RB all_DT) corn_NN seeds_NNS

Here, does that get treated as advmod(all, virtually) or advmod(seeds, virtually)? I don't like the second option. Maybe it can be distinguished here because it wasn't put in a QP.

What would be the relations for (NP (QP twice_PDT as many) stuff) or (NP (QP more than half_PDT) stuff)? Does the position of the PDT affect the relations? Found an example in EWT:

# sent_id = answers-20111108100523AA1i7no_ans-0011
14      less    less    ADJ     JJR     Degree=Cmp|ExtPos=ADV   16      advmod  16:advmod       _
15      than    than    ADP     IN      _       14      fixed   14:fixed        _
16      half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  18      det:predet      18:det:predet   _
17      a       a       DET     DT      Definite=Ind|PronType=Art       18      det     18:det  _
18      cm      cm      NOUN    NN      Number=Sing     13      obj     13:obj  _

Yet another different scheme is the phrase yet another, such as:

EWT example for (NP yet another NN)

(NP (ADJP (RB yet) (JJ another)) (NN police) (NN dispatch))
72      yet     yet     ADV     RB      _       73      advmod  73:advmod       _
73      another another DET     DT      PronType=Ind    75      det     75:det  _
74      police  police  NOUN    NN      Number=Sing     75      compound        75:compound     _
75      dispatch        dispatch        NOUN    NN      Number=Sing     69      obl     69:obl:for      _

Yet another is not even always parsed in a QP in WSJ:

(NP (RB yet) (DT another) ... (NN phenomenon))
(NP (QP (RB yet) (DT another)) (NN step))
(NP (ADJP (RB yet) (DT another)) (NN example))
(NP (RB yet) (DT another) (NN landscape) (NN architect))
(NP (RB yet) (DT another) (NNP Marlowe) (NN book))
(NP (ADJP (RB yet) (DT another) (NN week)))
(NP (ADJP (RB yet) (DT another)) (JJ unsettling) (NN parallel))
(NP (RB yet) (DT another) (NN setback))

so this particular example is highly annoying

ones that might legitimately be nummod: half a dozen ___, baker's dozen ___

Basically, need to work on some generalizable rules for rearranging / searching those subtrees

nschneid commented 4 months ago

These are good questions. I am not an expert on how PTB uses QPs and have been frustrated at the lack of documentation on the UD treatment of these kinds of constructions.

Basically, it seems to me some simple principles are:

predeterminers (PDT) should attach as det:predet and may have advmod dependents
plain determiners (DT) should attach as det and should NOT have advmod dependents (At least this should usually be true. There are at least some instances of "not all" and "yet another" that violate this—maybe we should change them?)

You are right that semantically, "half the students" and "half of the students" are very similar, but the second involves a PP so syntactically speaking, they have different heads.

AngledLuffa commented 4 months ago

Related question: what to do about about half the time? There's an example I found in PTB which is parsed like this:

          (VP (VBG rising)
            (NP
              (NP
                (QP (IN about) (PDT half) (DT a) )
                (NN point) )

My exploration of EWT has found something kind of similar...

# sent_id = reviews-190256-0005
# text = They had the work done in about half the time quoted which made me and my wife extremely happy.
7       about   about   ADV     RB      _       8       advmod  8:advmod        _
8       half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  10      det:predet      10:det:predet   _
9       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
10      time    time    NOUN    NN      Number=Sing     2       obl     2:obl:in        _

Another similar example in PTB:

                (NP
                  (QP (PDT half) )
                  (DT a) (NN percentage) (NN point) )

whereas the PTB revision changes it to

(QP (PDT half) (DT a))

But of course in typical PTB style this gets annotated differently elsewhere, such as

                      (NP (PDT half) (DT a) (NN percentage) (NN point) ))

      (VP (VBG declining)
        (NP-EXT (PDT half) (DT a) (NN point) )
        (ADVP-TMP (RB semiannually) )
        (PP-DIR (TO to)
          (NP (JJ par) ))))

        (NP (IN about) (PDT half) (DT a) (NN point) )    # IN??  why am I doing this to myself

So, just in general, we want this pattern?

det(time, the)
det:predet(time, half)
advmod(half, about)

AngledLuffa commented 4 months ago

yet another does seem unique among the RB DT thing pattern. At first, I was thinking of just searching for a|an|the, but nearly every other DT has the advmod pointing to the thing rather than the DT. For example:

12      nearly  nearly  ADV     RB      _       14      advmod  14:advmod       _
13      every   every   DET     DT      PronType=Tot    14      det     14:det  _
14      day     day     NOUN    NN      Number=Sing     11      obl:tmod        11:obl:tmod     SpaceAfter=No

AngledLuffa commented 4 months ago

(sorry for the repeated small messages)

also, this one is a bit different, with an IN in the middle:

(NP (QP just_RB over_IN a_DT) decade_NN)

I believe over and a would both attach to decade, with advmod(over, just)

nschneid commented 4 months ago

Yes agree with all these suggestions

AngledLuffa commented 3 months ago

Digging into one of the many tiny cases left, there's a tree which sounds a bit like Yoda:

( (SINV
    (ADVP (RB Also) )
    (VP-TPC-2 (VBN excluded)
      (NP (-NONE- *-1) ))
    (VP (MD will)
      (VP (VB be)
        (VP (-NONE- *T*-2) )))
    (NP-SBJ-1
      (NP (NNS investments) )
      (PP (IN in)
...

In this case, I expect the correct dependencies would be excluded as the root, aux(excluded, will), aux:pass(excluded, be)?

I have a change which fixes that one tree (and no others in PTB)

nschneid commented 3 months ago

Yup! This is a subject-dependent inversion.

AngledLuffa commented 3 months ago

There was only one of them in PTB, interestingly. Maybe it was the only one with that particular parse.

I came across an oddity in our converter when fixing that one... apparently the results can be different depending on the object identity of the dependency objects, which changed when I created new objects to resolve that dependency. Long story short, in the following sentence, where should what-8 attach?

( (S
    (NP-SBJ-1 (DT The) (JJ Soviet) (NNS purchases) )
    (VP (VBP are)
      (ADJP-PRD (JJ close)
        (PP (TO to)
          (S-NOM
            (NP-SBJ (-NONE- *-1) )
            (VP (VBG exceeding)
              (SBAR
                (WHNP-2 (WP what) )
                (S
                  (NP-SBJ (DT some) (NNS analysts) )
                  (VP (VBD had)
                    (VP (VBN expected)
                      (S
                        (NP-SBJ (DT the) (NNP Soviet) (NNP Union) )
                        (VP (TO to)
                          (VP (VB buy)
                            (NP (-NONE- *T*-2) )
                            (NP-TMP
                              (NP (DT this) (NN fall) )
                              (, ,)
                              (NP
                                (NP (DT the) (NN season) )
                                (SBAR
                                  (WHPP-3 (IN in)
                                    (WHNP (WDT which) ))
                                  (S
                                    (NP-SBJ (PRP it) )
                                    (ADVP-TMP (RB usually) )
                                    (VP (VBZ buys)
                                      (NP
                                        (NP (JJ much) )
                                        (PP (IN of)
                                          (NP
                                            (NP (DT the) (NN corn) )
                                            (SBAR
                                              (WHNP-4 (-NONE- 0) )
                                              (S
                                                (NP-SBJ (PRP it) )
                                                (VP (VBZ imports)
                                                  (NP (-NONE- *T*-4) )
                                                  (PP-DIR (IN from)
                                                    (NP (DT the) (NNP U.S.) ))))))))
                                      (PP-TMP (-NONE- *T*-3) ))))))))))))))))))
    (. .) ))

the two candidates which our converter produces are either obj(expected-12, what-8) or obj(buy-17, what-8)

it should attach to buy, right?

AngledLuffa commented 3 months ago

Also, as much as, this should be generally tagged & parsed the same as in this sentence? Looks pretty consistent in EWT

# sent_id = weblog-blogspot.com_alaindewitt_20040929103700_ENG_20040929_103700-0076
# text = We should know as much as we can.
1       We      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      3       nsubj   3:nsubj _
2       should  should  AUX     MD      VerbForm=Fin    3       aux     3:aux   _
3       know    know    VERB    VB      VerbForm=Inf    0       root    0:root  _
4       as      as      ADV     RB      _       5       advmod  5:advmod        _
5       much    much    ADJ     JJ      Degree=Pos      3       obj     3:obj   _
6       as      as      SCONJ   IN      _       8       mark    8:mark  _
7       we      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      8       nsubj   8:nsubj _
8       can     can     AUX     MD      VerbForm=Fin    5       advcl   5:advcl:as      SpaceAfter=No
9       .       .       PUNCT   .       _       3       punct   3:punct _

(edit: sometimes much is JJ, sometimes RB in EWT)

nschneid commented 3 months ago

"The Soviet purchases are close to exceeding what₂ some analysts had expected the Soviet Union to buy __₂ this fall": I think this is a free relative, so in the basic dependencies "what" should be the obj of "exceeding", with PronType=Rel, and "expected" should be its acl:relcl. In the enhanced dependencies "what" should also attach as the obj of "buy".
"as much as": yeah this looks like a standard comparative. "much" can be ADJ if it implies "much stuff", e.g. "We should know as much (information) as we can". Or it can be ADV, e.g. "People don't read magazines as much as they used to."

UniversalDependencies / docs

English PTB to UD 2.0 #717