ASCIIfy screen_name in dataset.py

euagendas / m3inference

A deep learning system for demographic inference (gender, age, and individual/person) that was trained on massive Twitter dataset using profile images, screen names, names, and biographies

http://www.euagendas.org

GNU Affero General Public License v3.0

145 stars 57 forks source link

ASCIIfy screen_name in dataset.py #34

Closed OsmanMutlu closed 2 years ago

OsmanMutlu commented 2 years ago

We need to asciify the screen_name since it can contain non-ASCII characters and the screenname_embed input size (EMBEDDING_INPUT_SIZE_ASCII in consts.py) is 128 (which covers the lower half of ASCII characters)

An example is below. Notice that screen_name starts with an 'ı' which is a non-ASCII Turkish character. {"description": "ileri", "id": "1", "lang": "tr", "name": "ılyas", "screen_name": "ılyas02805430"}

zijwang commented 2 years ago

Thanks. What was the current behavior you saw with no ascii normalization?

OsmanMutlu commented 2 years ago

Since ord('ı') is 305, it was throwing out of bounds error for screenname_embed = self.screenname_embed(screenname) in text_model.py

computermacgyver commented 2 years ago

Thank you, @OsmanMutlu .

@zijwang this will never be an issue with Twitter usernames as they are always ASCII, but can be an issue with usernames from other platforms.

I'm unsure if the proper fix is to convert to ASCII or simply replace non-ASCII usernames with the null/empty token. Given the model was trained only on Twitter (i.e., ASCII) usernames, I want to be careful here of unexpected side effects.

OsmanMutlu commented 2 years ago

Well, it seems that we had a bug in our Twitter API code and uppercase 'I' characters were turned into lowercase 'ı' characters instead of 'i' characters for screen names. Otherwise, you are correct that screen names are always in ASCII. Sorry for the misunderstanding.

computermacgyver commented 2 years ago

Ah, thanks for sharing that @OsmanMutlu . It's still a good issue to be aware of if the code is ever applied to non-Twitter usernames. For now, however, I think we can close the PR without merging. My concern is that I'm not sure how well the model will perform on non-ASCII usernames from somewhere other than Twitter. For now, I'd like to leave it causing an explicit error with non-ASCII usernames.

In the future, we could probably have a warning and then proceed. :pray: