Open rhdunn opened 1 year ago
Note: Sentence n01111021 has a form 1.4bn. -- Other treebanks, such as EWT, treat 1.4 and bn as two separate tokens. The bn is NumType=Card|NumForm=Word in EWT
Any thought on what to make 16bn
? Also split into two separate tokens? I'm not sure changing that tokenization is in our purview
I updated some here
https://github.com/UniversalDependencies/UD_English-PUD/pull/24
but have not done the Roman words yet
Any thought on what to make
16bn
?
your validation script missed V
and X
as Roman numerals
My validation script does detect V
and X
. The isssue is that the ones my script didn't identify are PROPN+CD
compared to NUM+CD
. My script was going on the UPOS tags listed in https://universaldependencies.org/u/feat/NumType.html. I should adjust my check to detect the use of NumType
on any UPOS other than PUNCT
and SYM
.
That does indicate that w05007004 has inconsistent UPOS for the roman numerals for token 15, 18, and 21. Token 21 should really be PROPN to be consistent with the PTB rules that the other treebanks like EWT use.
Oh, I hadn't even noticed that. I wonder if those are still supposed to have NumForm and NumType when they are of this tag. @nschneid or @amir-zeldes any thoughts on labeling Roman numerals when used as PROPN?
Hm, that's another inconsistency between GUM and EWT then, in GUM roman numerals after monarchs, WWII etc. are CD+NUM, not PROPN (the rest of the name is PROPN)
Doing a search, it looks like EWT is consistent with GUM in using CD+NUM for these -- e.g. email-enronsent07_01-0045
-- so it makes sense to use that to be consistent. PRON+CD looks like it is only used in the PUD treebank.
That's pretty easy to update as well. Added that to the previous Roman change:
https://github.com/UniversalDependencies/UD_English-PUD/pull/25
Mind rerunning the script on the new dev branch now that we've merged multiple changes?
^ fixed the stray EWT cases
@AngledLuffa I've published my script at https://github.com/rhdunn/conllu-en-validator.
I now get the following output:
$ ../conllu-en-validator/validate --language en --validator form en_pud-ud-test.conllu | grep -F "NumType=Card"
ERROR: Sentence n01005023 token 7 -- invalid NUM with NumType=Card|NumForm=Digit form '103.7'
ERROR: Sentence n01022027 token 20 -- invalid NUM with NumType=Card|NumForm=Digit form '1.5'
ERROR: Sentence n01043005 token 23 -- invalid NUM with NumType=Card|NumForm=Digit form '1.5'
ERROR: Sentence n01043014 token 8 -- invalid NUM with NumType=Card|NumForm=Digit form '1.4'
ERROR: Sentence n01043027 token 12 -- invalid NUM with NumType=Card|NumForm=Digit form '1.5'
ERROR: Sentence n01099035 token 9 -- invalid NUM with NumType=Card|NumForm=Digit form '6.30'
ERROR: Sentence n01111021 token 25 -- invalid NUM with NumType=Card|NumForm=Digit form '1.4'
ERROR: Sentence n01131007 token 3 -- invalid NUM with NumType=Card|NumForm=Digit form '5.7'
ERROR: Sentence w01029015 token 15 -- invalid NUM with NumType=Card|NumForm=Word form 'yellowish'
ERROR: Sentence w01096013 token 22 -- invalid NUM with NumType=Card|NumForm=Digit form '7.5'
ERROR: Sentence n03001030 token 10 -- invalid NUM with NumType=Card|NumForm=Digit form '23.45'
ERROR: Sentence n03010012 token 18 -- invalid NUM with NumType=Card|NumForm=Digit form '15.5'
Note: Aside from yellowish
-- which is an error -- these are because my script is expecting 1.5
, etc. to be annotated as NumType=Frac
.
I can change that. Anything other than NumType=Frac
or is that the complete feature?
I can also update the tag on yellowish
I suppose.
Hopefully my PI is okay with the idea that I spend quite a bit of time during one week once every six months around the next UD deadline @manning
Quite a few are still tagged with the Card
in EWT
29 4.5 4.5 NUM CD NumForm=Digit|NumType=Card 30 compound 30:compound _
30 billion billion NUM CD NumForm=Word|NumType=Card 28 nummod 28:nummod SpaceAfter=No
24 $ $ SYM $ _ 13 parataxis 13:parataxis SpaceAfter=No
25 13.9 13.9 NUM CD NumForm=Digit|NumType=Card 26 compound 26:compound SpaceAfter=No
26 M million NUM CD Abbr=Yes|NumForm=Word|NumType=Card 24 nummod 24:nummod _
27 from from ADP IN _ 28 case 28:case _
28 $ $ SYM $ _ 24 nmod 24:nmod:from SpaceAfter=No
29 11.5 11.5 NUM CD NumForm=Digit|NumType=Card 30 compound 30:compound SpaceAfter=No
30 M million NUM CD Abbr=Yes|NumForm=Word|NumType=Card 28 nummod 28:nummod SpaceAfter=No
14 May May PROPN NNP Number=Sing 10 obl 10:obl:on _
15 30th 30th NOUN NN Number=Sing|NumType=Ord 14 nummod 14:nummod _
16 @ @ ADP IN _ 17 case 17:case SpaceAfter=No
17 2.975 2.975 NUM CD NumForm=Digit|NumType=Card 10 obl 10:obl SpaceAfter=No
14 will will AUX MD VerbForm=Fin 15 aux 15:aux _
15 last last VERB VB VerbForm=Inf 3 conj 3:conj:and _
16 1.5 1.5 NUM CD NumForm=Digit|NumType=Card 17 nummod 17:nummod _
17 hours hour NOUN NNS Number=Plur 15 obl:tmod 15:obl:tmod _
and then there's phone numbers:
7 832.676.3177 832.676.3177 NUM CD NumForm=Digit|NumType=Card 5 appos 5:appos _
NumType=Frac
appears to only occur on written fraction words: half
, third
, tenth
, etc
Calling in the cavalry:
@nschneid @amir-zeldes
NumType=Frac
appears to only occur on written fraction words:half
,third
,tenth
, etc
This is definitely how NumType=Frac
was originally meant but I'm not sure if the concensus of English treebank maintainers hasn't shifted towards including 1.5 and such. I'm pretty sure it has been discussed somewhere.
I do recall that discussion as well. It also appears to be implemented that way in GUM, but not EWT or PUD
UniversalDependencies/docs#884
The 1.5
etc. forms would be NumType=Frac|NumForm=Digit
according to the features, and how they are annotated in GUM.
I can easily modify my validation script so that 1.2
, etc. are allowed on NumType=Card
and to report errors if digit forms with NumType=Frac
are used, or whatever the consensus is for these.
@AngledLuffa switched EWT to use Frac
for decimals like "1.2" in UniversalDependencies/UD_English-EWT@2faee04. Is that the consensus?
Validation issues:
Note: The numbers such as
7.5
should beNumType=Frac|NumForm=Digit
to be consistent with the GUM treebank.Note: Sentence
n01111021
has a form1.4bn
. -- Other treebanks, such as EWT, treat1.4
andbn
as two separate tokens. Thebn
isNumType=Card|NumForm=Word
in EWT.