datafaker-net / datafaker

Generating fake data for the JVM (Java, Kotlin, Groovy) has never been easier!
https://www.datafaker.net
Apache License 2.0
1.09k stars 151 forks source link

generate phone numbers using Google library "libphonenumber" #1249

Closed asolntsev closed 3 weeks ago

asolntsev commented 1 month ago

This library already contains phone number patterns for all countries. It's reasonable to reuse them instead of re-inventing the wheel.

Now PhoneNumberTest is stable, and doesn't need all these "max allowed errors count" hacks.

P.S. Maybe it's a good idea not to include this PR into the nearest release. It's quite a big change, maybe postpone it to a next major release?

erikpragt-connectid commented 1 month ago

Sorry, I'd like to veto this PR. I'm against bringing in new production libraries into Datafaker. It took me a lot of time to get most of them out, and only the Regex one is a special case for this. If you really want this, we could discuss making an SPI for this or so, that when you add a dependency to datafaker-phone-numbers it would use this library instead?

asolntsev commented 1 month ago

I'm against bringing in new production libraries into Datafaker.

Interesting take. :) And what's the reason of not including USEFUL libraries? Do you really want just to do the same job again?

NB! Look at PhoneNumberTest: DataFaker almost always generates an invalid phone number. For some countries, in 100% cases, and for most countries - in 77% cases.

@erikpragt-connectid Is there any reason in the world why generating invalid data is better than adding a dependency?

erikpragt-connectid commented 1 month ago

I'm against bringing in new production libraries into Datafaker.

Interesting take. :) And what's the reason of not including USEFUL libraries? Do you really want just to do the same job again?

There are many useful libraries, but that's not a reason to include them. Maybe we should have used Jackson to generate JSON and XML, and Apache CSV for the CSV files, and Hibernate for the SQL generation?

The Datafaker should be minimal library, small and non-invasive, and no dependencies. No dependencies is hard since the regex library and snakeyml are libraries which are hard to avoid since a large part of the codebase is built around it, and probably too complex to embed in the code itself.

It's also never the goal of Datafaker to provide 100% real data (Most likely the credit cards generated won't work either) The goal is to provide real looking data, and introducing a library for a niche problem (increasing the validity of some phone numbers a bit) isn't worth it. If this functionality is extremely important for an end user, there's nothing to stop them using google phone numbers directly, they don't need Datafaker for that.

asolntsev commented 1 month ago

there's nothing to stop them using google phone numbers directly, they don't need Datafaker for that

No, "libphonenumber" cannot generate random phone number. It can only validate phone numbers. It's exactly the niche of Datafaker to generate data.

Maybe we should have used Jackson to generate JSON and XML, and Apache CSV for the CSV files, and Hibernate for the SQL generation?

I agree that adding Jackson or Hibernate would be a bad idea BECAUSE generation of json/xml/sql is NOT the primary goal of DataFaker. BUT generation of random data IS the primary goal of DataFaker.

I believe this is a reasonable general rule: it's reasonable to add a dependency if:

  1. This library helps to achieve the primary goal, and
  2. This library is hard to replace with self-written code.

I think "libphinenumber" exactly matches both criterias (and Jackson/Hibernate do not match).

snuyanzin commented 1 month ago

Despite the fact I understand the rationale of usage libphonenumber I would agree that it's better not to add it. Datafaker provides a number of different generators and adding a library just to improve one of them could be overkill in future. For instance if we add a lib for phone numbers why not for addresses and etc... Then finally it will tend to be a super fat... At the same time not every user uses phone number generator at least once...

From the side as it was mentioned such lib could improve quality of generated data.

Thus I would suggest to think towards kind of plugin api (SPI or whatever). We already have custom provider support we allows to create a provider with better quality. May be we need to think about more fine-grained api allowing to use more accurate custom code for some providers (currently just brainstorming)

P.S.: I would even remove snakeyaml and regexp libs or made them pluggable, however it came with javafaker and requires large investment to start using them only optionally...

asolntsev commented 1 month ago

@snuyanzin @erikpragt-connectid Ok, would you mind creating a new repository like https://github.com/datafaker-net/datafaker-phones? Who will be responsible for releasing new versions then? Someone will need to always remember to release datafaker-phones.jar every time when they release datafaker.jar It's easy to forget. Are you sure it's convenient?

P.S. I sincerely don't understand why you want to avoid dependencies. There is absolutely no reasons for this. Sorry, it's not normal. All projects have dependencies. Look around, the world is global. Countries use each other's products. Projects use each other as dependencies. Etc.

snuyanzin commented 1 month ago

Ok, would you mind creating a new repository like https://github.com/datafaker-net/datafaker-phones? Who will be responsible for releasing new versions then? Someone will need to always remember to release datafaker-phones.jar every time when they release datafaker.jar It's easy to forget. Are you sure it's convenient?

should we have a new repo? I guess having another module depending on a library would be ok. Since then there could be a core module without extra dependencies and some other with. At the same time since everything in one repo then all could be released together and there is no issue with forgetting to release something

asolntsev commented 1 month ago

Ok, you suggest to make a multi-module project from datafaker? Yes, sure, why not. I can modify the PR accordingly.

bodiam commented 1 month ago

Well, do you really need the library at runtime if all you need is the formats? As in, if we would use the lib phone number at build time to extract the formats, would that be an option?

kingthorin commented 1 month ago

libohonenumber's metadata is available as an xml file. I guess we could create a GitHub workflow to pull it and update our yaml(s) from time to time.

Though that seems like a lot of extra work to for some reason avoid including a 350k dependency. (I know you're worried about bloat but I suspect we could come up with some reasonable criteria/rationale to be applied in this and future cases.)

bodiam commented 1 month ago

Javafaker had a ton of dependencies. Most of the dependencies had bugs or CVEs, which caused Javafaker to have CVEs too. Besides that, dependencies can have dependencies, and those dependencies can conflict with other dependencies.

Maybe "no extra dependencies" is too strict, but I wouldn't know where to draw the line. Just keep in mind we have around 500.000 downloads per month, so that's the impact we can have on people.

kingthorin commented 1 month ago

Javafaker had a ton of dependencies. Most of the dependencies had bugs or CVEs, which caused Javafaker to have CVEs too. Besides that, dependencies can have dependencies, and those dependencies can conflict with other dependencies.

Understood, I wouldn't want us to get into modern dependency hell like Node projects 😀

Maybe "no extra dependencies" is too strict, but I wouldn't know where to draw the line. Just keep in mind we have around 500.000 downloads per month, so that's the impact we can have on people.

That's a good stat to know, and definitely helps contextualize things.

As for criteria, based just on what's been mentioned so far.

Of course we can edit or add more 🤷‍♂️

bodiam commented 1 month ago

Thanks for that writeup, appreciate it! Given that it doesn't introduce a whole tree of dependencies (see the merits of dependency free libraries? 😉) and the size is small, and all of the above, plus that we live in a democracy, and "code wins", I'm happy to change my mind on this. Maybe I'll maakt a small note in the contributing section (be mindful when introducing new libraries, or something like that).

What do you think @snuyanzin ?

bodiam commented 1 month ago

Small question though : is the result still repeatable? I think I saw something in the PR about having to generate a phone number a couple of times to create a valid one?

asolntsev commented 1 month ago

is the result still repeatable

Good question. But yes, the result is highly repeatable (all tests pass).

But yes, inside we have a loop a'la "up to 100 retries". The reason is that regex is not enough to generate a 100% valid phone number. We generate a random phone using regex, but then need to validate it using "libphonenumber" because some generated numbers are still not valid because of area codes etc.

asolntsev commented 4 weeks ago

@bodiam @snuyanzin You decide. I can prepare another PR which adds "libphonenumber" generation as a separate module, no problems. It may be even good to show others how they could add other modules as well.

The only drawback is that such PR would move all existing sources to a subfolder, something like "core" or "engine" or just "datafaker". It's because Maven multi-module architecture requires the root project have type "pom", and submodules can have type "jar".

bodiam commented 4 weeks ago

I think the current way of doing it is fine, no need to overcomplicate things. I'm happy if you want to merge it.