Closed Meekohi closed 4 years ago
I think a better list would be Urban areas as defined by United States Census Bureau (https://en.wikipedia.org/wiki/List_of_United_States_urban_areas). Charlottesville is on the list.
Works for me 👍Let me know if there's anything I can help with.
Looking closer I think this might not be a great plan. The "Metro Area" is often a concatenated name, but doesn't include all major cities inside that Metro Area.
For example: Anniston–Oxford metropolitan area Contains as it's major cities: Anniston, Oxford, and Jacksonville
I think someone would be much more likely to use "Jacksonville" than "Anniston-Oxford" in this case.
Script to get names of Metro Areas with over 50,000 population:
curl -s "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | awk -F ',|( +)' '{ if($4 > 50000) {print $2} }' | uniq
@Meekohi
Hello!
May I use this list in the British speller?
I am maintaining the GB speller.
Also, what are the names with two "--" and with "-"s? Should I delete them?
Thank you!
Ohhhh.. I already added your first list in 2018.
Is this a new one with more names?
http://proofingtoolgui.org/en_GB_README.html
" Cities from US On V2.65 I added tons of cities in the US with a 10 000+ population, since they are in valid English. This list was supplied by Michael Holroyd on Kevin Atkinson's GitHub. "
Yes to be clear -- the second list above I just posted is based on the "US Census Urban Area" as Kevin suggested -- but I don't actually think it's a good idea to use, I prefer the first list which is of "US Census Cities". Just wanted to post it here so everyone could discuss (The US Census data is annoying to parse without dumb tricks like that awk
script). A lot of the urban areas are named by squishing together a few of the bigger cities' names with hyphens.
Definitely welcome to use any of the work from here! Glad it was useful!
@Meekohi the idea is to get some missing names. Major cities are already included for the most part. 10,000+ population is too small and your list seams to excluded many unincorporated areas (for example Ellicott City), some of which may have significance. Some states (such as Maryland, and in particular Howard County) don't have a lot of incorporated towns. If you include all those the list will be very huge and include many areas most people probably never herd of.
If you can find a way to extract Core Cities from each Metro Areas I will consider including those also.
Thanks for the feedback @kevina, I'll investigate.
I don't know how to judge what might count as "too many new words to include", but I would personally err on the side of adding more place names than fewer.
Unfortunately it seems quite challenging to filter "Census Designated Places" (such as Ellicott City) by population using the census data.
Best list based on the Urban Areas I could get was to just replace the --
with a newline:
curl -s -B "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | iconv -f iso8859-1 -t utf-8 | awk -F ',|( +)' '{ if(NR > 1 && $4 > 50000) {print $2} }' | LC_ALL=C sed $'s/--/\\\n/g' | sed 's/ County//g' | sort | uniq
2\. Best list based on the Urban Areas I could get was to just replace the `--` with a newline: `curl -s -B "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | iconv -f iso8859-1 -t utf-8 | awk -F ',|( +)' '{ if(NR > 1 && $4 > 50000) {print $2} }' | LC_ALL=C sed $'s/--/\\\n/g' | sed 's/ County//g' | uniq`
@Meekohi Hello!
Could you please repost the cities here with your new script? I noticed that in the previous list there were some issues with characters as they appeared as "?" and the "--".
Thank you!
Aberdeen
Abilene
Aguadilla
Akron
Albany
Albuquerque
Alexandria
Allentown
Alton
Altoona
Amarillo
Ames
Anaheim
Anchorage
Anderson
Angleton
Ann Arbor
Anniston
Antioch
Apotgan
Appleton
Arecibo
Arlington
Arroyo Grande
Asheville
Atascadero
Athens-Clarke
Atlanta
Atlantic City
Auburn
Augusta-Richmond
Aurora
Austin
Avon Park
Avondale
Bakersfield
Baltimore
Bangor
Barceloneta
Barnstable Town
Baton Rouge
Battle Creek
Bay City
Beaumont
Beckley
Bel Air North
Bel Air South
Bellingham
Beloit
Bend
Benton Harbor
Berwick
Beverly Hills
Billings
Binghamton
Birmingham
Bismarck
Blacksburg
Bloomington
Bloomsburg
Boise City
Bonita Springs
Boston
Boulder
Bowling Green
Bradenton
Bremerton
Bridgeport
Bristol
Brownsville
Brunswick
Bryan
Buffalo
Burlington
Cabo Rojo
Calexico
California
Camarillo
Canton
Cape Coral
Cape Girardeau
Carbondale
Carson City
Cartersville
Casa Grande
Casper
Cathedral City
Cedar Rapids
Chambersburg
Champaign
Charleston
Charlotte
Charlotte Amalie
Charlottesville
Chattanooga
Chesapeake Ranch Estates
Cheyenne
Chicago
Chico
Cincinnati
Citrus Springs
Clarksville
Cleveland
Coeur d'Alene
College Station
Colorado Springs
Columbia
Columbus
Concord
Connellsville
Conroe
Conway
Corpus Christi
Corvallis
Covington
Cumberland
Dallas
Dalton
Danbury
Danville
Daphne
Davenport
Davis
Dayton
Daytona Beach
DeKalb
Decatur
Dededo
Delano
Deltona
Denton
Denver
Des Moines
Detroit
Dothan
Dover
Dubuque
Duluth
Durham
East Stroudsburg
Eau Claire
El Centro
El Paso
El Paso de Robles (Paso Robles)
Eldersburg
Elizabethtown
Elkhart
Elmira
Elyria
Erie
Eugene
Eustis
Evansville
Fair Plain
Fairbanks
Fairfield
Fairhope
Fajardo
Fargo
Farmington
Fayetteville
Fitchburg
Flagstaff
Flint
Florence
Florida
Florida Ridge
Fond du Lac
Fort Collins
Fort Smith
Fort Walton Beach
Fort Wayne
Fort Worth
Frederick
Fredericksburg
Fresno
Gadsden
Gainesville
Gastonia
Gilroy
Glens Falls
Goldsboro
Goodyear
Grand Forks
Grand Island
Grand Junction
Grand Rapids
Grants Pass
Grayslake
Great Falls
Greeley
Green Bay
Greensboro
Greenville
Grover Beach
Guayama
Gulfport
Hagerstown
Hammond
Hanford
Hanover
Harlingen
Harrisburg
Harrisonburg
Hartford
Hattiesburg
Hazleton
Hemet
Henderson
Hesperia
Hickory
High Point
Hightstown
Hilton Head Island
Hinesville
Holland
Homosassa Springs
Hot Springs
Houma
Houston
Howell
Huntington
Huntsville
Idaho Falls
Imbéry
Indianapolis
Indio
Iowa City
Isabela
Ithaca
Jackson
Jacksonville
Janesville
Jefferson City
Johnson City
Johnstown
Jonesboro
Joplin
Juana Díaz
Kahului
Kailua (Honolulu)
Kalamazoo
Kaneohe
Kankakee
Kansas City
Kennewick
Kenosha
Killeen
Kingsport
Kingston
Kissimmee
Knoxville
Kokomo
La Crosse
La Porte
Lacey
Lady Lake
Lafayette
Lake Charles
Lake Forest
Lake Havasu City
Lake Jackson
Lakeland
Lancaster
Lansing
Laredo
Las Cruces
Las Vegas
Lawrence
Lawton
Layton
Lebanon
Lee's Summit
Leesburg
Leominster
Lewiston
Lewisville
Lexington Park
Lexington-Fayette
Lima
Lincoln
Little Rock
Livermore
Lodi
Logan
Lompoc
Long Beach
Longmont
Longview
Lorain
Los Angeles
Los Lunas
Louisville
Louisville/Jefferson
Lubbock
Lynchburg
Machanao
Macon
Madera
Madison
Manchester
Mandeville
Manhattan
Mankato
Mansfield
Manteca
Marysville
Mauldin
Mayagüez
McAllen
McHenry
McKinney
Medford
Melbourne
Memphis
Menifee
Merced
Mesa
Miami
Michigan City
Middletown
Midland
Milwaukee
Minneapolis
Mission Viejo
Missoula
Mobile
Modesto
Monessen
Monroe
Monterey
Montgomery
Morgan Hill
Morgantown
Morristown
Mount Vernon
Muncie
Murfreesboro
Murrieta
Muskegon
Myrtle Beach
Nampa
Napa
Nashua
Nashville-Davidson
Navarre
New Bedford
New Bern
New Haven
New London
New Orleans
New York
Newark
Newburgh
Normal
Norman
North Charleston
North Port
Norwich
Oakland
Ocala
Odessa
Ogden
Oklahoma City
Olympia
Omaha
Orem
Orlando
Oshkosh
Owensboro
Oxford
Oxnard
Palm Bay
Palm Coast
Palmdale
Panama City
Parkersburg
Pascagoula
Pasco
Pensacola
Peoria
Petaluma
Philadelphia
Phoenix
Pine Bluff
Pittsburgh
Pittsfield
Pocatello
Ponce
Port Arthur
Port Charlotte
Port Huron
Port Orange
Port St. Lucie
Porterville
Portland
Portsmouth
Pottstown
Poughkeepsie
Prescott
Prescott Valley
Providence
Provo
Pueblo
Racine
Radcliff
Raleigh
Rapid City
Reading
Redding
Reno
Richmond
Riverside
Roanoke
Rochester
Rock Hill
Rockford
Rocky Mount
Rogers
Rome
Round Lake Beach
Sabana Grande
Sacramento
Saginaw
Salem
Salinas
Salisbury
Salt Lake City
San Angelo
San Antonio
San Bernardino
San Clemente
San Diego
San Francisco
San Germán
San Jose
San Juan
San Luis Obispo
San Marcos
San Sebastián
Santa Barbara
Santa Clarita
Santa Cruz
Santa Fe
Santa Maria
Santa Rosa
Sarasota
Saratoga Springs
Savannah
Schenectady
Scranton
Seaside
Seattle
Sebastian
Sebring
Sheboygan
Sherman
Shreveport
Sierra Vista
Simi Valley
Simpsonville
Sioux City
Sioux Falls
Slidell
Socastee
South Bend
South Lyon
Spartanburg
Spokane
Spring Hill
Springdale
Springfield
St. Augustine
St. Cloud
St. George
St. Joseph
St. Louis
St. Paul
St. Petersburg
Stamford
State College
Staunton
Steubenville
Stockton
Sumter
Syracuse
Tallahassee
Tampa
Tavares
Temecula
Temple
Terre Haute
Texarkana
Texas City
The Villages
The Woodlands
Thousand Oaks
Titusville
Toledo
Topeka
Tracy
Trenton
Tucson
Tulsa
Turlock
Tuscaloosa
Tutu
Twin Rivers
Tyler
Uniontown
Urban Honolulu
Utica
Vacaville
Valdosta
Vallejo
Vero Beach South
Victoria
Victorville
Villas
Vineland
Virginia Beach
Visalia
Waco
Waldorf
Walla Walla
Warner Robins
Washington
Waterbury
Waterloo
Watertown
Watsonville
Wausau
Waynesboro
Weirton
Wenatchee
West Bend
West Valley City
Westminster
Wheeling
Wichita
Wichita Falls
Williamsburg
Williamsport
Wilmington
Winchester
Winston-Salem
Winter Haven
Woodland
Worcester
Wright
Yakima
Yauco
York
Youngstown
Yuba City
Yuma
Zephyrhills
The above is based on:
curl -s -B "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | iconv -f iso8859-1 -t utf-8 | awk -F ',|( +)' '{ if(NR > 1 && $4 > 50000) {print $2} }' | LC_ALL=C sed $'s/--/\\\n/g' | sed 's/ County//g' | sort | uniq
Wow fascinating -- Github is reducing the two spaces before the +
into one even in code. Clearly a bug. The above isn't safe to copy/paste. Use:
https://gist.github.com/Meekohi/d03154775c68aac00470419fe2a6a5ac
Working around Github's madness:
curl -s "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | iconv -f iso8859-1 -t utf-8 | awk -F ',|(\ \ +)' '{ if(NR > 1 && $4 > 50000) {print $2} }' | sed $'s/--/\\\n/g' | sed 's/ County//g' | sort | uniq
@Meekohi you might be able to upload a plain text file as an attachment. If not, then copying and pasting the list into a gist should avoid the problem.
Here is the list I think I am going to use, it was derived from https://www2.census.gov/geo/docs/reference/ua/ua_list_ua.txt with some manual cleanups.
The fact that I am very selective in adding words has been rehashed many times over. A large word list can cause problems in that it can clutter the suggestion list and also mask more common words. For example "Stookey", a township in IL that barley makes the list is very close to the more common word "stocky". See http://app.aspell.net/lookup-freq?words=Stookey.
In order for me to add small towns they need to have some significance beyond just there population and would want to figure out how to include Census Designated Place as they are fairly significant.
Yes it can be frustrating for your home town to be marked as misspelled, but in nearly any spellchecker that can be rectified by adding it to your personal wordlist.
Nevertheless I am not satisfied with the list derived solely from US Census Urban Area so I will leave this issue open.
Thanks @kevina -- I support using the US Census Urban Area as well. Here is the result of
curl -s "https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt" | iconv -f iso8859-1 -t utf-8 | awk -F ',|(\ \ +)' '{ if(NR > 1 && $4 > 50000) {print $2} }' | sed $'s/--/\\\n/g' | sed 's/ County//g' | sort | uniq > urban_area.txt
--
characters (urban_area.txt)Would a version also split on spaces be more useful?
Would a version also split on spaces be more useful?
No. I may do some additional processing on the list to split them, but it needs to be done with care. For example each place should take a possessive form, but that should only be added to the last part in most cases.
Roger that -- let me know if there's anything else I can help with.
For now, I did not split names with spaces, parts of the more common names are already added (for example "Los" and "Angeles") and only added then to en_US.
I was able to extract the population of CDP from the U.S. Census Bureau see https://github.com/en-wl/places-us. At a later date I will likely consider adding some additional names from that list see https://github.com/en-wl/wordlist/issues/254.
I would like to have "Charlottesville" added to the
.60
dictionaries at least (it currently is only in the.70
's), especially given its recent attention in the news: http://app.aspell.net/lookup?dict=en_US;words=charlottesvilleI propose we add all locations with a population of more than 10,000 according to the most recent US census (2010).
https://www.census.gov/data/datasets/2017/demo/popest/total-cities-and-towns.html#ds
Cleaned up list is below in order of population with duplicates removed (obviously most of the ones at the top are already in the dictionary).