UniversalDependencies / UD_English-GUM

Other
32 stars 5 forks source link

Tokenization of degree symbols #83

Closed AngledLuffa closed 2 months ago

AngledLuffa commented 2 months ago

In a sentence such as this

# text = Preheat oven to 350 ° F (177 ° C).

Is this a satisfactory tokenization of ° F?

If I look up degrees F on Wikipedia, for example, it has sentences such as

but the original paper suggests the lower defining point, 0 °F

which makes me think °F should be stuck together

amir-zeldes commented 2 months ago

We also tokenize dollar symbols, which are pronounced "dollars", and this would be pronounced "degrees/NNS Fahrenheit/NNP", so I think the tokenization is correct this way.

martinpopel commented 2 months ago

Most English manuals of style agree on no space between the degree symbol and C or F (there are even Unicode code points: U+2103 ℃ DEGREE CELSIUS and U+2109 ℉ DEGREE FAHRENHEIT). So we can assume that writing the space there ("° F" and "° C") is always a typo.

However, the text metadata should contain the original raw (untokenized) text, including possible typos. If there was "350 ° F (177 ° C)" in the original text, we need to preserve it in the text metadata.

At the original URL https://www.wikihow.life/Prepare-Quinoa#Cooking-In-the-Oven, I can see "350 °F (177 °C)." However, the document was added to GUM on 2016-09-19 and I don't know how to search the history of WikiHow pages. So either the original document contained a typo in 2016, which is now fixed, or the GUM authors introduced the typo. In the latter case, the typo should be fixed (we want to preserve only typos originating from the original document, not from UD processing).

Yet another question is whether we want to treat "°F" and "°C" as a single (syntactic) word in UD or two. I haven't decided yet which one I prefer. I see that UD_English-CTeTex treats it as two words (even when originally written together, e.g. "-20°C to +55°C"). UD_English-ParTUT has one instance of "-40°" as a single word (which I don't like) followed by "C", but this was written originally with a space before C. If we decide for the two-words annotation, as @amir-zeldes suggested above, we would need to treat the single-character Unicode symbols (U+2103 ℃ DEGREE CELSIUS) as multi-word tokens.

nschneid commented 2 months ago

I am against using multiword tokens for anything that is an orthographic convention rather than a morphosyntactic process. IMO symbols e.g. ° resemble punctuation in that they might orthographically lack a separating space but still be considered distinct syntactic words. So SpaceAfter=No seems like the best option if we treat °F as two syntactic words.

While I don't think conversion to pronounced words is necessarily determinative for the treatment of symbols/notation (e.g. "2.6" would be read as three words), it is a heuristic that can serve as a default when we recognize individual symbol characters as mapping to individual words. So I don't see a problem with the GUM policy.

martinpopel commented 2 months ago

I am against using multiword tokens for anything that is an orthographic convention

But there are already many cases when MWTs are used only for orthographic conventions, e.g. cannot vs can not.

symbols e.g. ° resemble punctuation in that they might orthographically lack a separating space but still be considered distinct syntactic words. So SpaceAfter=No seems like the best option if we treat °F as two syntactic words.

I agree (and I wrote that I would also prefer SpaceAfter=No, in this case). I was mentioning MWTs as an alternative mostly because I don't see any other option when we choose the two-words solution, but the original text contains the U+2103 DEGREE CELSIUS Unicode character ℃. This is really a single character and a single code point (not even a combining mark plus a character C, although separating combining marks into another word would be problematic as well). The concatenation of tokens in CoNLL-U (and spaces) must render exactly the original text. I admit, there is no U+2103 in the current data, so we may ignore this for now.

So I don't see a problem with the GUM policy.

If the original sentence included no space before F and C and these were added when importing to UD, especially if this was a systematic policy, I consider it a problem, which should be fixed.

nschneid commented 2 months ago

I meant the GUM policy of treating ° as a separate syntactic word. I don't think it's GUM policy to alter the original text string.

amir-zeldes commented 2 months ago

Hi again - indeed, GUM has the policy of representing the original text as accurately as possible; however this is done by students completely manually, so it is not an error free process!

I had a look - unfortunately, wikihow does not make older versions of its pages available, apparently as a policy, but I got lucky: I had an older, less edited version of the original data locally, so I was able to determine this was an error. Wikivoyage is easier, since we can just look at the history.

The degree symbol appears in three documents, and they behave a little differently:

The last two can be viewed directly in the version corresponding to the dateCollected metadatum:

https://en.wikivoyage.org/w/index.php?title=Phoenix&oldid=3050999

I'll go ahead and fix it in the source files, the fix will propagate on the next release. Thanks for catching this!

PS - I also think SpaceAfter is the way to go here, the same as how we treat punctuation in English in general, as opposed to clitics.