brianvoe / gofakeit

Random fake data generator written in go
MIT License
4.49k stars 263 forks source link

Better address #196

Open brianvoe opened 2 years ago

brianvoe commented 2 years ago

Right now the city may not be in the same state as well as the zip code and the full address isnt a good representation of actual usage

dhartford commented 1 year ago

most of the implementations I've seen with this capability requires a data file (csv, pipe, whatever) that has all 3 columns for US-based city,state,zipcode so it can be returned as a single tuple/struct equiv. Different Countries....different solutions.

Psuedo Example for U.S.: mycity, ZZ, 98765 mysuburb, ZZ, 98765 mymetro, ZZ, 98789

notes: yes, in the u.s., multiple cities for same zipcode

Usually the state is -either- the state spelled out, or state code, may need a helper function to convert back and forth pending on usecase.

Parameterized to support range-based randomization to focus on desired need, i.e. only from one or several state codes, or one or several zipcodes, to avoid over-sizing randomization and trimming later.

I'm not sure where other implementations get those data files however...so if someone knows, that might be the first step!

p.s. would not recommend 'real' street address for this library, but if 'real' city/state/zip would at least work with any geo-heatmap-analytic kind of work even if its randomized if ignoring street address.

brianvoe commented 1 year ago

I completely agree if anyone has that data and can make it easily usable. Im open to seeing that pr.

rashmi-tondare commented 1 year ago

Hi @brianvoe, I'd be interested in working on this issue. I was trying to find some data sources for addresses and came across this repo - https://github.com/dr5hn/countries-states-cities-database Wanted to get your thoughts on this, do you think it would be a good idea to use the countries-states-cities.json file as a data source? We could use a random int to fetch the country first, then state, then city etc.

brianvoe commented 1 year ago

I do like the idea. Im going to give you some of the challenges I see and you(or anyone else) can let me know their thoughts.

  1. Data size - I just looked at the countries-states-cities.json file and its 36 mb. Even if we removed all the parts we didnt need and got the size down to 10 mb. I feel like that is still too much to add to everyones app especially for most people that may not even use address data. If we could get it to sub 1 mb then maybe thats something we can work with.
  2. Licensing - Im not too familiar with Open Data Commons licensing but we would have to figure that out as well.
rashmi-tondare commented 1 year ago
  1. Data size - We could use go-bindata which converts any file into Go code where the file data is converted to a byte slice along with helper functions to fetch & decode this data into in memory structs. I tried out a small PoC and it generated a 13.8mb file from the 36mb json file. We can clean up the data to only have fields that we need, that would considerably bring down the generated file size too. However, this data doesn't contain zipcodes, so if that is essential will need to look for another source.
  2. Licensing - Would that really be an issue since ODC allows for both commercial and private use?
brianvoe commented 1 year ago
  1. So one of the things that i was trying to cross reference with is a population for each city. If I knew population sizes then I could include only top 20 or so cities in each state and that would significantly lower the size of the data. My town only has 10,000 or so people in it and I dont think i would care if it was included in a random data selection. I want to try to stay away from doing something like go-bindata just for simplicity sake
  2. As far as license I dont know if i can maintain an MIT license and import something else that isnt MIT. But if we only use a subset of that data Im not sure licensing is still an issue. I never really know where that line is. If a package has a simple function that lets say adds two numbers together, can no one else have a function that does the same thing? I dont know.
rashmi-tondare commented 1 year ago
  1. We could clean it up to just include 20 random cities in each state. I think it should be fine even if that means we end up skipping some popular cities as long as we have the right city-state-country relation. We could mention this explicitly in the documentation.
  2. This is a very valid point and honestly I don't know what the right answer is. But as indicated in these 2 issues, the author seems to be fine with people using modified versions of the data as long as it's credited in the README: 2.1 https://github.com/dr5hn/countries-states-cities-database/issues/179 2.2 https://github.com/dr5hn/countries-states-cities-database/issues/272
brianvoe commented 1 year ago

ok sounds good to me. Ill try to see if i can get this file size as small as possible. Ill see if i can find some sort of list that may indicate to population or popularity and cross reference that. Well figure it out.

rashmi-tondare commented 1 year ago

Hi @brianvoe, just wanted to touch base with you regarding the data clean up. Is there anything I can do to help with that? Were you able to find a data source for city-wise population?

brianvoe commented 1 year ago

Sorry I havent been able to look into this I am trying to finish a new implementation of another open source project I have called SlimSelect. Once I am done other there I can switch back and try to figure this out.

If anyone has time please try to look into this, if just finding population data that would be a huge help in getting this feature implemented.

rashmi-tondare commented 1 year ago

Found something here world-cities. From the description:

This datapackage only list cities above 15,000 inhabitants

The json file is ~2MB and has city, state, country info. We'll still have to cross-reference some other data for the pincodes, but let me know if this seems like a usable, reliable data source.

brianvoe commented 1 year ago

I think this is a good try but it still doesnt allow me to lower the output limit based upon popuplation. I still think 15,000 would still be too large for this open source package.