Closed OsmanMutlu closed 2 years ago
Thanks. What was the current behavior you saw with no ascii normalization?
Since ord('ı')
is 305, it was throwing out of bounds error for screenname_embed = self.screenname_embed(screenname)
in text_model.py
Thank you, @OsmanMutlu .
@zijwang this will never be an issue with Twitter usernames as they are always ASCII, but can be an issue with usernames from other platforms.
I'm unsure if the proper fix is to convert to ASCII or simply replace non-ASCII usernames with the null/empty token. Given the model was trained only on Twitter (i.e., ASCII) usernames, I want to be careful here of unexpected side effects.
Well, it seems that we had a bug in our Twitter API code and uppercase 'I' characters were turned into lowercase 'ı' characters instead of 'i' characters for screen names. Otherwise, you are correct that screen names are always in ASCII. Sorry for the misunderstanding.
Ah, thanks for sharing that @OsmanMutlu . It's still a good issue to be aware of if the code is ever applied to non-Twitter usernames. For now, however, I think we can close the PR without merging. My concern is that I'm not sure how well the model will perform on non-ASCII usernames from somewhere other than Twitter. For now, I'd like to leave it causing an explicit error with non-ASCII usernames.
In the future, we could probably have a warning and then proceed. :pray:
We need to asciify the screen_name since it can contain non-ASCII characters and the screenname_embed input size (EMBEDDING_INPUT_SIZE_ASCII in consts.py) is 128 (which covers the lower half of ASCII characters)
An example is below. Notice that screen_name starts with an 'ı' which is a non-ASCII Turkish character. {"description": "ileri", "id": "1", "lang": "tr", "name": "ılyas", "screen_name": "ılyas02805430"}