UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

list dependency for an apparent appositive #536

Closed AngledLuffa closed 2 months ago

AngledLuffa commented 3 months ago
# sent_id = email-enronsent17_01-0014
# text = 7. Stephen Covey, author, The Seven Habits of Highly Effective People
1       7       7       NUM     LS      NumForm=Digit|NumType=Card      3       discourse       3:discourse     SpaceAfter=No
2       .       .       PUNCT   .       _       1       punct   1:punct _
3       Stephen Stephen PROPN   NNP     Number=Sing     0       root    0:root  _
4       Covey   Covey   PROPN   NNP     Number=Sing     3       flat    3:flat  SpaceAfter=No
5       ,       ,       PUNCT   ,       _       6       punct   6:punct _
6       author  author  NOUN    NN      Number=Sing     3       list    3:list  SpaceAfter=No    <----
nschneid commented 3 months ago

The document has a bunch of "NAME, JOB TITLE" combos. I'm not sure if appos works because it requires the two nominals to be reversible.

nschneid commented 2 months ago

The relevant part of the document:

Here are the top ten most requested eSpeakers.

  1. Jack Welch, CEO, General Motors
  2. Scott McNeally, CEO, Sun Microsystems
  3. Satisfied Enron Customers
  4. Stephen Covey, author, The Seven Habits of Highly Effective People
  5. Oprah Winfrey, talkshow host

and so on.

I think list is defensible here. These are not really sentences, but structured data with values separated by commas

amir-zeldes commented 2 months ago

Why is "7." tokenized apart? It doesn't actually contain a period, right? I thought it was just a list marker as a whole.

nschneid commented 2 months ago

Punctuations in list item markers are tokenized off in EWT.

amir-zeldes commented 2 months ago

Hm, not sure if we have the energy to standardize this, but it does seem jarring to me, since it really doesn't mean anything. In ON they are mostly untokenized, though I see there are quite a few exceptions. GUM-style corpora are 100% untokenized as well.

nschneid commented 2 months ago

moved tokenization discussion to a new issue: #543

The question for this issue is whether we need to change list to appos. I don't see a clear justification for that.

amir-zeldes commented 2 months ago

The question for this issue is whether we need to change list to appos. I don't see a clear justification for that.

Oh, certainly, wasn't trying to argue about that, I just noticed the LS thing. Thanks for opening the other issue!