XPOS? - Githubissues

AngledLuffa commented 1 year ago

Do you have any interest in converting the XPOS to a formalism similar to that in PTB, EWT, GUM, etc? Totally fine if you have a use for the XPOS which precludes that, but I note there are many blanks in the current XPOS annotations, and the treebank would be much more compatible with its English neighbors if it used the same style of XPOS.

LarsAhrenberg commented 1 year ago

I don't want to change the XPOS annotation in UD_English-LinES as it is needed for backward compatibility with the original treebank and with UD_Swedish-LinES.

nschneid commented 1 year ago

I agree with @AngledLuffa that the dataset would be more useful if its XPOS column followed the standard of other English corpora (i.e., PTB tagset). What if the current XPOS were retained in MISC for backward compatibility?

nschneid commented 1 year ago

For the record, the contents of the XPOS column are not really part-of-speech tags at all: they are morphological features in a non-UD notation, more or less redundant with the FEATS features. I definitely think the appropriate place for this is in custom feature in the MISC column (e.g. LinESMorph=SG-NOM) so that users working with many UD corpora at a time don't use the XPOS column inappropriately.

# sent_id = en_lines-ud-dev-doc2-3322
# text = Quinn guessed her age at around twenty.
1   Quinn   Quinn   PROPN   SG-NOM  Number=Sing 2   nsubj   _   _
2   guessed guess   VERB    PAST    Mood=Ind|Tense=Past|VerbForm=Fin    0   root    _   _
3   her she PRON    P3SG-GEN    Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs   4   nmod:poss   _   _
4   age age NOUN    SG-NOM  Number=Sing 2   obj _   _
5   at  at  ADP _   _   7   case    _   _
6   around  around  ADP _   _   7   case    _   _
7   twenty  twenty  NUM CARD-PL NumType=Card    2   obl _   SpaceAfter=No
8   .   .   PUNCT   Period  _   2   punct   _   _

dan-zeman commented 1 year ago

don't use the XPOS column inappropriately

Inappropriately? There is no such thing :-) XPOS is treebank-specific. Some of you guys seem to put too much weight on it. If you want cross-treebank compatibility, you should use UPOS+FEATS. Just because the other English treebanks opted for the Penn tagset in XPOS does not mean that it must be used in all English treebanks.

And for the record, in many languages XPOS contains morphological information, so conversion of XPOS to UD has to go partly to UPOS and partly to FEATS. (After all, even in the Penn tagset, the distinction between NN and NNS is a morphological feature.) There is definitely no need to put this to the MISC column.

nschneid commented 1 year ago

Hmm, I was under the impression that users might build aim to build POS taggers based on the XPOS column for large-scale evaluations (i.e. testing within-treebank but doing so across lots of treebanks) and misinterpreting the results if some of them are blanks rather than tags.

If XPOS is really just any morphosyntactic information that can be treebank-specific, perhaps this definition should be broadened: "XPOS optionally contains a language-specific part-of-speech tag, normally from a traditional, more fine-grained tagset".

(A practical advantage of standardizing XPOS within-language is that common scripts relying on them can be used, e.g. for generating enhanced dependencies. But of course it is up to the treebank maintainers to determine whether they want to use those scripts.)

LarsAhrenberg commented 1 year ago

I've always thought that any models you build from UD treebanks should use UPOS and FEATS and avoid the XPOS. column. Thanks @dan-zeman for making this clear. In any case, changing the values in that column would require a manual review of parser output for which I don't have time presently.

dan-zeman commented 1 year ago

XPOS within-language is that common scripts relying on them can be used

That's exactly what I believe should be strongly discouraged. Common scripts should rely on the other columns.

dan-zeman commented 1 year ago

perhaps this definition should be broadened

Good point. Clarification added.

amir-zeldes commented 1 year ago

That's exactly what I believe should be strongly discouraged. Common scripts should rely on the other columns

Just wanted to join and say I think that's going a bit far - some languages have XPOS in common across TBs and it makes sense to use that information, especially because XPOS is often manually checked, while UPOS, at least for older datasets, is usually the product of an automatic conversion. For languages like English, XPOS is a relatively stable standard, while UPOS changes quite a bit, so I usually base scripts that need POS tags on XPOS...

dan-zeman commented 1 year ago

That's exactly what I believe should be strongly discouraged. Common scripts should rely on the other columns

Just wanted to join and say I think that's going a bit far - some languages have XPOS in common across TBs and it makes sense to use that information, especially because XPOS is often manually checked, while UPOS, at least for older datasets, is usually the product of an automatic conversion. For languages like English, XPOS is a relatively stable standard, while UPOS changes quite a bit, so I usually base scripts that need POS tags on XPOS...

But the goal of UD should be that UPOS is reliable, stable, and together with FEATS conveys all important information from XPOS. I understand that if you do not trust UPOS in a dataset, you look at XPOS instead, but from the UD perspective I do not think this approach should be supported. Instead of making it a policy that XPOS matters, we should fix UPOS and FEATS in the datasets.

AngledLuffa commented 1 year ago

While that's an admirable goal, from a modeling perspective it's much, much easier to embed 40 xpos instead of 20 upos and therefore capture the tense of verbs or the plurality of nouns. I know that there's similar information in the features, and more information beyond that we could also make use of, but that's a lot more work.

Having said that, I also see Lars's point that there's a specific use for this kind of xpos in LinES, and I'm way, way too lazy / busy to try to maintain a parallel annotation of any kind (and not seeing any momentum for anyone else doing it, either)

jnivre commented 1 year ago

I completely agree with Dan on this. It is not an admirable goal, it is the only conceivable goal for UD.

Joakim

Skickat från Outlook för iOShttps://aka.ms/o0ukef

Från: John Bauer @.> Skickat: Saturday, October 7, 2023 4:47:25 PM Till: UniversalDependencies/UD_English-LinES @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/UD_English-LinES] XPOS? (Issue #8)

While that's an admirable goal, from a modeling perspective it's much, much easier to embed 40 xpos instead of 20 upos and therefore capture the tense of verbs or the plurality of nouns. I know that there's similar information in the features, and more information beyond that we could also make use of, but that's a lot more work.

Having said that, I also see Lars's point that there's a specific use for this kind of xpos in LinES, and I'm way, way too lazy / busy to try to maintain a parallel annotation of any kind (and not seeing any momentum for anyone else doing it, either)

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/UD_English-LinES/issues/8#issuecomment-1751730037, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVVX2FIP327KWAKQL63X6FTP3AVCNFSM6AAAAAA4XC7DRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJRG4ZTAMBTG4. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

amir-zeldes commented 1 year ago

It is not an admirable goal, it is the only conceivable goal for UD.

I respectfully disagree - upos and feats consistency is one goal for UD, sure, and I have nothing against it. But the developers of UD treebanks in a single language can and should IMO aspire to convergence in xpos, just as they should aspire to have consistency in MISC annotations and other non-universal parts of their TBs. There are sometimes reasons for important differences between TBs, but generally speaking, the more consistent the better.

nschneid commented 1 year ago

I agree with @amir-zeldes that for practical purposes it's nice if treebank maintainers within a language can agree to standardize XPOS. But it sounds like the UD standard does not make any recommendations regarding XPOS standardization.

UniversalDependencies / UD_English-LinES

XPOS? #8