faker-js / faker

Generate massive amounts of fake data in the browser and node.js
https://fakerjs.dev
Other
12.98k stars 919 forks source link

Lorem EN `word.ts` contains a single character in the pool #3261

Open konarx opened 17 hours ago

konarx commented 17 hours ago

Pre-Checks

Describe the bug

I was using lorem.word for testing, and I had a failing test. I was confident enough to ping the developer: -"Hey, you have a bug here." -"No, you are providing a single character when I expect a word of at least two characters." -"No, I don't. I use a library that specifically uses random words, not chars". -"Yes, you are. Here, see the payload you provided yourself." -"Ah..." And here I am :) There is an 'a' character here, which is NOT a word, so I do not think it should be in this pool.

Minimal reproduction code

No response

Additional Context

No response

Environment Info

-

Which module system do you use?

Used Package Manager

npm

ST-DDT commented 15 hours ago

FFR:

https://github.com/faker-js/faker/blob/467bd83dbd2f34dbd5080c45de18901823f83469/src/locales/en/lorem/word.ts#L2

The English locale does have other one character (non-lorem) words as well (e.g. most prominently I and a). Do you consider those to be words? To be clear, I'm not against removing it, I just wish to understand your usecase a bit more. Because if we remove a from the list, then maybe someone else considers 2 letter words to be too short.

If you need words of a certain length, have you tried faker.word.sample({length: { min: 2, max: 1000 }})? Or do you specifically need a similar feature for the lorem words?

konarx commented 13 hours ago

Do you consider those to be words?

I understand your perspective, but this approach might be a bit abstract. For instance, the character I can also represent a Roman numeral, so it feels more suited to be treated as a character rather than a word in the traditional sense. Personally, I find it a bit misleading to keep single-character elements like I and a in a word pool—they're more accurately handled within a character set or pool rather than a word list.

If you're aiming for control over word length, I’d recommend focusing on ensuring that single characters don’t get pulled into word contexts, rather than adjusting word definitions. That way, we can keep words to truly represent terms rather than individual characters.

ST-DDT commented 12 hours ago

Thanks for sharing your opinion. This is really useful in understanding the expectations of our users, their thoughts and decision making processes.

Is it possible for you to share

ST-DDT commented 12 hours ago

In a sense, these one character words have found the exact issue they are meant to find. Namely, finding differences in the understanding of specific terms and maybe outlining potential to improve the documentation and specification.

Do you expect to get them, when you ask for a word? In this case: No And more importantly: Do you think of them, when you define the input as "a word", do your/our users think of them? Should you/they? How do we communicate that with our respective users? Most (two letter) words aren't any more useful/valid by themselves as one letter words.

I and a are in the English dictionary, so at least some people consider them to be words. I'll consult a Latin lexicon later and we will discuss this issue in the next team-meeting.

matthewmayer commented 12 hours ago

"a" is a valid Latin word like "a populo" (by the people) as is "e" (e pluribus unum).

ST-DDT commented 12 hours ago

@matthewmayer Would you expect lorem.word() to return these one letter word? And what is your opinion regarding a word length parameter?

matthewmayer commented 12 hours ago

In a sense, these one character words have found the exact issue they are meant to find.

I agree with this. Having the one character word led to a conversation between two people which led to a better understanding of what the actual requirements for a parameter were. That's a good thing.

Similarly having words like jalapeño in the English word list might help uncover a hidden requirement that a "word" is supposed to be ASCII #1538

konarx commented 9 hours ago

"a" is a valid Latin word like "a populo" (by the people) as is "e" (e pluribus unum).

This is probably the most accurate explanation; thank you, @matthewmayer .

In my case, I opted not to use en/word/adjective.ts because I needed to create a simple Label—just a straightforward, character-free string that could serve as an indicator. Since the adjective includes hyphenated (-) terms like black-and-white and extra-large, it didn’t quite fit my needs. So, avoiding those entries was the better choice for me.

However, we can all agree that some words and characters overlap categories, which might be a bit confusing. Ideally, each string should fit the closest category—like Nick being both a name and a word, but I wouldn't expect to find it in the name pool (it's not, it's just an example from the top of my head).

Thank you all for the insights and the clarification!