Closed AngledLuffa closed 2 months ago
We also tokenize dollar symbols, which are pronounced "dollars", and this would be pronounced "degrees/NNS Fahrenheit/NNP", so I think the tokenization is correct this way.
Most English manuals of style agree on no space between the degree symbol and C or F (there are even Unicode code points: U+2103 ℃ DEGREE CELSIUS and U+2109 ℉ DEGREE FAHRENHEIT). So we can assume that writing the space there ("° F" and "° C") is always a typo.
However, the text
metadata should contain the original raw (untokenized) text, including possible typos. If there was "350 ° F (177 ° C)" in the original text, we need to preserve it in the text
metadata.
At the original URL https://www.wikihow.life/Prepare-Quinoa#Cooking-In-the-Oven, I can see "350 °F (177 °C)." However, the document was added to GUM on 2016-09-19 and I don't know how to search the history of WikiHow pages. So either the original document contained a typo in 2016, which is now fixed, or the GUM authors introduced the typo. In the latter case, the typo should be fixed (we want to preserve only typos originating from the original document, not from UD processing).
Yet another question is whether we want to treat "°F" and "°C" as a single (syntactic) word in UD or two. I haven't decided yet which one I prefer. I see that UD_English-CTeTex treats it as two words (even when originally written together, e.g. "-20°C to +55°C"). UD_English-ParTUT has one instance of "-40°" as a single word (which I don't like) followed by "C", but this was written originally with a space before C. If we decide for the two-words annotation, as @amir-zeldes suggested above, we would need to treat the single-character Unicode symbols (U+2103 ℃ DEGREE CELSIUS) as multi-word tokens.
If we choose the two-word annotation guidelines
SpaceAfter=No
(I would prefer the latter, but as noted above, it cannot be applied to U+2103).CorrectSpaceAfter=No
, according to the UD guidelines on typos.If we choose the single-word annotation guidelines
deprel=goeswith
.I am against using multiword tokens for anything that is an orthographic convention rather than a morphosyntactic process. IMO symbols e.g. ° resemble punctuation in that they might orthographically lack a separating space but still be considered distinct syntactic words. So SpaceAfter=No
seems like the best option if we treat °F as two syntactic words.
While I don't think conversion to pronounced words is necessarily determinative for the treatment of symbols/notation (e.g. "2.6" would be read as three words), it is a heuristic that can serve as a default when we recognize individual symbol characters as mapping to individual words. So I don't see a problem with the GUM policy.
I am against using multiword tokens for anything that is an orthographic convention
But there are already many cases when MWTs are used only for orthographic conventions, e.g. cannot
vs can not
.
symbols e.g. ° resemble punctuation in that they might orthographically lack a separating space but still be considered distinct syntactic words. So
SpaceAfter=No
seems like the best option if we treat °F as two syntactic words.
I agree (and I wrote that I would also prefer SpaceAfter=No
, in this case). I was mentioning MWTs as an alternative mostly because I don't see any other option when we choose the two-words solution, but the original text contains the U+2103 DEGREE CELSIUS Unicode character ℃. This is really a single character and a single code point (not even a combining mark plus a character C, although separating combining marks into another word would be problematic as well). The concatenation of tokens in CoNLL-U (and spaces) must render exactly the original text. I admit, there is no U+2103 in the current data, so we may ignore this for now.
So I don't see a problem with the GUM policy.
If the original sentence included no space before F and C and these were added when importing to UD, especially if this was a systematic policy, I consider it a problem, which should be fixed.
I meant the GUM policy of treating ° as a separate syntactic word. I don't think it's GUM policy to alter the original text string.
Hi again - indeed, GUM has the policy of representing the original text as accurately as possible; however this is done by students completely manually, so it is not an error free process!
I had a look - unfortunately, wikihow does not make older versions of its pages available, apparently as a policy, but I got lucky: I had an older, less edited version of the original data locally, so I was able to determine this was an error. Wikivoyage is easier, since we can just look at the history.
The degree symbol appears in three documents, and they behave a little differently:
177 °C
24°C
30s°F
!The last two can be viewed directly in the version corresponding to the dateCollected
metadatum:
https://en.wikivoyage.org/w/index.php?title=Phoenix&oldid=3050999
I'll go ahead and fix it in the source files, the fix will propagate on the next release. Thanks for catching this!
PS - I also think SpaceAfter
is the way to go here, the same as how we treat punctuation in English in general, as opposed to clitics.
In a sentence such as this
Is this a satisfactory tokenization of
° F
?If I look up degrees F on Wikipedia, for example, it has sentences such as
which makes me think
°F
should be stuck together