faker-js / faker

Generate massive amounts of fake data in the browser and node.js
https://fakerjs.dev
Other
12.41k stars 894 forks source link

Weird email and username in Chinese locale package #1105

Closed shtse8 closed 1 year ago

shtse8 commented 2 years ago

Describe the bug

email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.

Reproduction

code

// import { faker } from '@faker-js/faker';
import { faker } from '@faker-js/faker/locale/zh_CN'

export const USERS: User[] = []

export function createRandomUser(): User {
  return {
    userId: faker.datatype.uuid(),
    username: faker.internet.userName(),
    email: faker.internet.email(),
    avatar: faker.image.avatar(),
    password: faker.internet.password(),
    birthdate: faker.date.birthdate(),
    registeredAt: faker.date.past(),
  }
}

Array.from({ length: 1 }).forEach(() => {
  USERS.push(createRandomUser())
})

console.log(USERS)

output

[
  {
    userId: '88d30bb6-c783-4e56-8ffc-6778ec6e1c0a',
    username: '钰轩.侯68',
    email: '明杰_彭@gmail.com',
    avatar: 'https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/765.jpg',
    password: 'UdVxsDkMWFajEId',
    birthdate: 1964-10-12T19:43:31.378Z,
    registeredAt: 2022-04-27T11:56:33.741Z
  }
]

Additional Info

No response

Shinigami92 commented 2 years ago

https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔

ST-DDT commented 2 years ago

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

shtse8 commented 2 years ago

https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔

it is not the case. as there is possible to support Chinese in domain, username and email in theory and in standard. but it's not in practical. Chinese is very difficult to input comparing other languages.

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

because there is not possible to use Chinese in email and username most of the time on any site, which won't allow to input due to difficult to handle in tech way, parsing Chinese is relatively difficult. also, it's much easier to enter in English which can be directly from keyboard - one char by one char.

in Chinese world, there are many ways to transcribe our Chinese name to English. In Hong Kong, we are using our English name or Cantonese phonic name on our id card. For example, surname is Chan, surname is Cheung, first nameisHang. so if someone called張恒`, his might use "Cheung Hang" as his English name. useing "cheunghang" as username, and using "cheunghang@gmail.com" as email.

Many of us have read English name taken by ourselves like Peter, Simon. so if 張恒 takes a English as Peter. He might take Peter Cheung as his English display name. as use it on username and email.

In Mainland China, Taiwan and other Mandarin speaking places like Sigapore, Malysia, they are using Pinyin (Mandarin phonic), for example, surname (Traditional Chinese) or (Simplified Chinese) is Chen, surname "張" or is Zhang, first nameisHeng`. so if someone called "張恒", his might use "Zheng Heng" as his English name. useing "zhangheng" as username, and using "zhangheng@qq.com" as email.

Let's take a look on DouYin (抖音) (Chinese version TikTok) https://www.douyin.com/user/MS4wLjABAAAAvOpuhpSOPCAvoa6Slgg54m1DtiTBR4ac003SlM86yoxlmMF3AnnF2c8LzHEocAMj

image

抖音号 is username on the platform. this user picked Sariel_740399. I guess Sariel is his English name and 740399 is something meaning to her, like birthday?

https://www.douyin.com/user/MS4wLjABAAAApDszKVp0whQtJRUaaDmKnrshCmZ5gwZwcXXnvYsAUFE

image

this user picked wobushixumengjie. while her Chinese name is 洁梦徐, last name should put on the front in Chinese. So her real Chinese name should be 徐梦洁, she just reverse enter her name. Pinyin of 徐梦洁 is Xu Meng Jie which is part of her username. wobushi is the Pinyin of 我不是 (meaning I am not) which is Wo Bu Shi.

Hope it can help to be more fake on faker

Shinigami92 commented 2 years ago

Just my opinion and idea:

I feel like this breaks out of scope for faker itself. It uses a simple algorithm right now where a first name and last name are just inserted for the email. Faker is not a converter library that specifically converts chinese to english names.

So my proposal (and we can freely discuss about that) would be:

Create/Use a package, to covert chinese names to english counterparts and pass them into the email function of faker.

ST-DDT commented 2 years ago

IMO we could probably add a locale like en_CN that contains some Chinese sounding (first?/)lastnames, so it possible to generate Peter Cheung as "English" version of the Chinese name, which will then be used to generate the email.

However, this would be up to the user to explicitly select as locale, because technically it not Chinese anymore and phonetically converting the text probably takes more than 50 lines of code. And some users might explicitly want chinese usernames and email addresses, because they have to verify, that it works with those as well. (In Germany, it is possible to use Umlaute äöüß in E-Mail Addresses. Yes, it is rare, but some people prefer it over the "asci" converted variants (ae, oe, ue, sz).)

export function createRandomUser(): User {
  return {
    userId: fakerZH.datatype.uuid(),
    username: fakerEN_CN.internet.userName(),
    email: fakerEN_CN.internet.email(),
    avatar: fakerZH.image.avatar(),
    password: fakerZH.internet.password(),
    birthdate: fakerZH.date.birthdate(),
    registeredAt: fakerZH.date.past(),
  }
}

If we add some kind of internal workaround, to delegate to the English Faker ourselves, then we won't be able to split faker into individual locale modules anymore.

@shtse8 What do you think about the en_CN locale approach?

import-brain commented 2 years ago

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

There is a romanization system for Chinese characters called "pinyin" as @shtse8 said, but I'm not sure if there's an easy way to transliterate characters into it. I'll look into it.

Edit: Problem is, some Chinese characters have multiple ways to pronounce them based on context :/

Shinigami92 commented 2 years ago

and just one google search away, typing in pinyin npm, the first result is: https://www.npmjs.com/package/pinyin

and there are even alternativ packages

so I think this is currently the best workaround for now


according to this answer on stackoverflow: https://stackoverflow.com/a/760151/6897682 we might want to think about an option to allow/disallow non-english letters and switch strategy based on that I wont like to have a special case just for chinese in our code base

ST-DDT commented 2 years ago

Today another "affected" method and locale showed up: internet.domainWord() https://discord.com/channels/929487054990110771/929544565348777984/990970477138833428

We might have to add an option onlyAscii or similar to some of the internet methods.

schw4rzlicht commented 2 years ago

Especially with internet.domainWord() (or internet.domain() for that matter) it's kind of annoying b/c it leads to our CI failing over and over again (as we validate domain inputs) and always b/c of the word jalapeño which is randomly appearing.

From what I understand, not all TLDs are even accepting internationalized domain names (wiki), so I think it is out of scope for faker to determine which are and keep track of that. Imo, domain words should just not include non-ASCII chars to keep it simple.

matthewmayer commented 1 year ago

Perhaps locales which aren't in ASCII script should optionally be able to provide an alternative set of ASCII first names and last names to be used in contexts that require ascii like email addresses? For example zh_CN, ar, el

matthewmayer commented 1 year ago

Sample output for

    Object.keys(faker.locales).forEach(locale=>{faker.setLocale(locale); console.log(`${locale}: ${faker.internet.email()}`)})
af_ZA: Harvey_Ferreira60@gmail.com
ar: .@yahoo.com
az: Kellie_Hansen@yahoo.com
cz: Krytof9@atlas.cz
de: Lisann_Tsamonikian@yahoo.com
de_AT: Lenja2@gmail.com
de_CH: Marlies29@hotmail.com
el: .@gmail.com
en: Isobel40@yahoo.com
en_AU: Eliza_Edwards@yahoo.com
en_AU_ocker: Oliver46@gmail.com
en_BORK: Vita.Buckridge78@yahoo.com
en_CA: Fausto18@gmail.com
en_GB: Adrienne.Konopelski@yahoo.com
en_GH: person.female_first_name.Kusi@hotmail.com
en_IE: Oswaldo.Dietrich@hotmail.com
en_IN: Baalaaditya15@yahoo.co.in
en_NG: Titi.Christian94@yahoo.com
en_US: Erik83@hotmail.com
en_ZA: Amelia_Connelly33@yahoo.com
es: Esteban93@gmail.com
es_MX: Mayte.Ruiz@nearbpo.com
fa: 72@yahoo.com
fi: Oskari.Hmlinen@hotmail.com
fr: Flavie_Nguyen@hotmail.fr
fr_BE: Freda7@advalvas.be
fr_CA: Daija_Osinski@yahoo.ca
fr_CH: Arion13@hotmail.com
ge: _@posta.ge
he: 14@gmail.com
hr: David.Zdelar48@gmail.com
hu: Dina63@outlook.com
hy: .@gmail.com
id_ID: Paul_OKeefe@yahoo.co.id
it: Igor24@libero.it
ja: 太一.中村73@gmail.com
ko: 71@yahoo.co.kr
lv: Grover_Kshlerin@apollo.lv
mk: 44@hotmail.com
nb_NO: Herman_Strand@yahoo.com
ne: Raju26@gmail.com
nl: Nick.Janssen33@gmail.com
nl_BE: Amy44@gmail.com
pl: Gerald.Urbanowicz44@yahoo.com
pt_BR: Lvia98@yahoo.com
pt_PT: Edgar60@mail.pt
ro: Trenton46@hotmail.com
ru: Seamus.Carter@yahoo.com
sk: Bethany.Parisian@zoznam.sk
sv: Monica.Axelsson@gmail.com
tr: Brbars1@yahoo.com
uk: Hellen_Price34@ukr.net
ur: .@gmail.com
vi: VinhDiu.Mai@yahoo.com
zh_CN: 鑫鹏_宋@gmail.com
zh_TW: 樂駒76@hotmail.com
zu_ZA: Maphikelela.Mabhida@hotmail.com

I note there are two groups of locales with slightly different problems zh_CN, zh_TW and ja contain unstripped non-ASCII characters

ar, el, fa, ge, he, hy, ko, mk, ur are stripped down and generally only contain _.01234567890, often giving an invalid address like .@gmail.com

matthewmayer commented 1 year ago

The difference seems to come down to the fact that faker.helpers.slugify has some exceptions for Japanese and Chinese characters

https://github.com/faker-js/faker/blame/next/src/modules/helpers/index.ts#L37

slugify(string: string = ''): string {
    return string
      .replace(/ /g, '-')
      .replace(/[^\一-龠\ぁ-ゔ\ァ-ヴー\w\.\-]+/g, '');
  }

Note the Chinese and Japanese characters here are not stripped but Cyrillic, Arabic, Korean are:

faker.helpers.slugify("ABCD123 靖琪 結衣 용환.예 Саве.Панговски زینہ81") //'ABCD123-靖琪-結衣-.-.-81'
matthewmayer commented 1 year ago

... and that was originally introduced here: https://github.com/faker-js/faker/commit/0d3809d4c83f9f5c29d99040df84b7353fe32255

It seems to have caused more problems than it solved, so perhaps that could be reverted, and a more general solution found for all the non-ascii-ish locales.

ST-DDT commented 1 year ago

I dont think that @example.com is any more useful than <InsertChineseCharactersHere>@example.com.

matthewmayer commented 1 year ago

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers, e.g.

mj1234415@example.com

... at least it would be a valid email address.

matthewmayer commented 1 year ago

i created #1554 as a tentative solution for this. Not sure would be the best long term solution but it at least means that all locales return valid, ascii, email addresses.

kz-d commented 1 year ago

email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.

At least, as for email addresses, the same goes for the Japan. (If you enter a Japanese email address, it will be rejected by validation, even on most systems used in Japan)

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers

I think this fix will help!

matthewmayer commented 1 year ago

Thanks @kz-d good to get a Japanese opinion too :) I guess the #1554 PR will help with #1437 also