coderholic / django-cities

Countries and cities of the world for Django projects
MIT License
920 stars 374 forks source link

[WIP] Boundary Fields #159

Open george-silva opened 7 years ago

george-silva commented 7 years ago

Goals

Considerations

  1. The boundary field is nullable and of type MultiPolygon, to accurately represent data that have islands, such as Japan, that is made up of multiple parts;
  2. The boundary field was added to Place model, so all inheritors have the field as well.

TODO

  1. [x] Fields added to the models;
  2. [x] Changed from ROOT_URLCONF test_project.urls to test_app.urls - when trying to migrate the test_app would not find test_project.urls, so this is kinda of a fix;
  3. [x] Changed WSGI_APPLICATION from test_project.wsgi.application to test_app.wsgi.application. Also kinda of a fix.
  4. [ ] Determine the best source for GIS data;
  5. [ ] Change the import command so it can handle boundary data;

Edit (by blag): Changed checklists into GitHub-flavored Markdown TODO list so it gets a progress bar in the PR list page

blag commented 7 years ago

I would use Geonames shapes_simplified_low.zip file for country boundaries. It's got two tab-separated columns: geonameid and geojson, so the existing get_data function should handle it just fine. You can deserialize the geojson with django-geojson or geojson.

Importing other sources of boundary data is probably going to be more involved. I'll look into that.

blag commented 7 years ago

I thought I configured Travis to run our tests all pull requests, but that option was turned off.

I've turned it back on. If you push any more to this pull request, it should run the tests automatically against Python 2.7, 3.3-3.6 and on Django 1.7-1.10.

I'll be adding tests with Django 1.11 Real Soon Now (tm); no later than its official release.

blag commented 7 years ago

Closing and reopening to try to kickoff a Travis run.

blag commented 7 years ago

@george-silva If you don't want to push any more commits, but want to run Travis tests, it may work if you close and reopen this PR. I don't really have time right now to debug why Travis isn't working, but it's something I'll focus on fixing tonight or tomorrow.

blag commented 7 years ago

This repo has some of the GeoJSON files we need for the US:

https://github.com/jgoodall/us-maps/tree/master/geojson

The original source for those files also has information for "Urban Areas" and "Consolidated Cities" from the 2000 & 2010 US census:

http://www.census.gov/geo/maps-data/data/tiger-line.html

I'm still looking around for sources for info for other countries.

george-silva commented 7 years ago

Hello @blag!

I think the geonames source for countries will work out just fine.

I'll check today if OSM has the state/city data.

thanks for the tip regarding Travis. I'll keep an eye on it.

blag commented 7 years ago

@george-silva Sorry, I wasn't clear: yep, I agree, let's use the Geonames data for countries, period.

For boundaries of country subdivisions (eg: regions and below), I would also like to use OSM data wherever possible - it's comprehensive (international), highly precise, clearly licensed, and legally unencumbered.

Check out these per-country boundary files:

https://mapzen.com/data/borders/

We could only import boundaries for selected countries using all of those dumps, or we could download the entire planet file if all boundaries are chosen.

The OSM wiki has a good explanation on administrative levels:

https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative

and if I'm reading that correctly, it means we could pick out boundaries for regions, subregions, cities, and districts from a single boundary file.

I'm still not sure where we could get postal code areas for countries. That's where that source comes in. The zcta5.json file has exactly what we want, but only for the US. Finding postal code boundaries for every other country might be more challenging.

george-silva commented 7 years ago

@blag well, I guess the conjunction between the OSM wiki + mapzen's data will suffice.

The hard part is do organize the data that is in the wiki.

Here's my proposal for this:

We have:

  1. Ignore postal areas for now.
  2. Continents and countries are consistent. We can use OSM data and I think it will be easy to download/use;
  3. for all the other levels (region/city), we use OSM data, from mapzen. It will be easy to replicate their infrastructure, if they decide to take it out, later on.
  4. We'll create a dict maping on our code, to specify which model corresponds to each boundary type in OSM.

Something like:

CITIES_BOUNDARY_MAPPING = {
    # country code, list of administrative levels that correspond with our proposed boundary
    'bra': [4,8],
    'foo': [3,6]
}

# or

CITIES_BOUNDARY_MAPPING = {
    # country code, list of administrative levels that correspond with our proposed boundary
    'bra': {'region': 4, 'city': 8},
    # etc
}

If you check the docs from OSM's wiki, you can see that 4 in Brazil corresponds to states and 8 to cities.

In this way we can download and use the correct data and we don't need to map out all of the countries upfront. We can let other users add their own mappings. And if it's a settings, they don't even need to PR, they can configure it for their project.

The downside of this approach is that we require an extra conf. step and if the user wants to use different administrative regions (in Brazil's case he wants to use regions, states and macro-regions) we won't be able to do it.

What do you think about that?

blag commented 7 years ago

That all sounds good, I would like to make sure we have good default options, to minimize the number of options people have to change.

blag commented 7 years ago

We may be able to use Zillow's data for District objects in the US:

http://www.zillow.com/howto/api/neighborhood-boundaries.htm

Although the CC-BY-SA 3.0 license may not work for some of our users.

I'm not trying to focus just on the US, but it doesn't seem that there is high quality corresponding data for other countries.

blag commented 7 years ago

And this may work for cities:

http://www.gisgraphy.com/

Openstreetmap data extract by country

  • ...
  • Extracted the shape of more than 160,000 cities and localities from Quatroshapes with their associated geonames Id
george-silva commented 7 years ago

@blag quattroshapes is interesting.

I've downloaded quattroshapes and I've downloaded the mapzen's country data to check them out.

Findings:

  1. Why is so important to have an geonames id? For what I've looked in the models we don't store them. Is this the code model attribute, that varies from model to model? If so, it might be fine.
  2. Mapzen's data is perfect. The problem is matching the data against our current downloaded data. For brazilian states, the tag ref was the one being used to associate the state code. I'm not sure if that holds true for all. The schema we discussed earlier would work perfectly, thought that might need to be a dict of dicts, where the user specifies the admin level and the field where the join between the data will be made (quattroshapes might need the same settings)
  3. Mapzen's data can be downloaded per country;
  4. Quattroshapes means that for countries, states and cities we need to download 3 shapefiles, admin0, 1 and 2. We need to download it fully, but we can filter at import time, if the user/dev only wants a single country;

Which data is best

  1. They pretty much look the same, but OSM data might be updated more frequently.
  2. Quattroshapes data is smaller. ~300mb zipped. Only Brazil is 48Mb from Mapzen;
  3. Quattroshapes needs to be downloaded fully;
  4. OSM data is updated more often;

Import strategies

  1. We require PostGIS, so GDAL is also available. That means we have the LayerMapping utility. We only need to map the ID in question and the geometry field;
  2. Regardless of the boundary data source, filtering must be done on already present data from GeoNames. I imagine a loop in the selected countries for it's child entities and the import process for each.

Suggestions? Considerations?

I wanted to look at the options first beforing writing any code.

Just to be clear, my preference: OSM/Mapzen. It will be trickier but I think it's a good source, configurable, etc.

george-silva commented 7 years ago

@blag this might be better (also from Mapzen): https://whosonfirst.mapzen.com/

nvkelso commented 7 years ago

Who's On First is the successor to Quaatroshapes and includes neighbourhoods and postal codes. Mapzen has multiple staff working on the project, it's seen huge progress the last 18 months. There are multiple download options (metafiles, bundles).

Please let us know how we can be of help.

blag commented 7 years ago

@george-silva

We import the geonameid as the primary key for continents, countries, regions, subregions, cities, and alternative names. For some reason that doesn't hold true for districts, even though the data source includes that information. At this point I want to stay backwards compatible for existing users, but I have been thinking of eventually creating a release that isn't backwards compatible, and one of the changes I want to make is to use their geonameid as their id.

The rest sounds good.

@nvkelso Awesome, thanks for all of your hard work! It would help us if you included the geonameids in your data, and separated your data by country so we only need to download/import the minimum amount of data.

nvkelso commented 7 years ago

Metafiles have country ids so you can filter and download.

Geonames ids are in the concordances lists.

On Feb 2, 2017, at 16:30, blag notifications@github.com wrote:

@george-silva

We import the geonameid as the primary key for continents, countries, regions, subregions, cities, and alternative names. For some reason that doesn't hold true for districts, even though the data source includes that information. At this point I want to stay backwards compatible for existing users, but I have been thinking of eventually creating a release that isn't backwards compatible, and one of the changes I want to make is to use their geonameid as their id.

The rest sounds good.

@nvkelso Awesome, thanks for all of your hard work! It would help us if you included the geonameids in your data, and separated your data by country so we only need to download/import the minimum amount of data.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

george-silva commented 7 years ago

Ok, I've managed to understand whos' on first data.

What we'll need here is to:

  1. Download the CSV meta file for the interested places. We we'll need to filter on common stuff, so the best bet here are country ISO codes;
  2. Once we download the metafile (it needs to be fully downloaded), we can capture the WOF ID and URL for that country;
  3. We load up country boundary, based on a API (since we are filtering per country, I guess it's the easy way - we can download also via AWS)
  4. We download the metafiles for the other layers of interest (state/region and county)
  5. Filter those metafiles based on the first obtained ID and grab a list of URLs where the actual data is located;
  6. Loop the records and grab each one with a geojson serializer/deserializer, filter the current database data based on geonames and update.

Quite involved process :smile:

I'll start some new modules to do all this work.

nvkelso commented 7 years ago

We're also experimenting with "bundles" per placetype (but downloads for entire planet):

blag commented 7 years ago

@george-silva Sounds good, thanks for taking this on. ❤️

@coderholic Do you have any advice/recommendations for us?

george-silva commented 7 years ago

Hello guys. I'll be at a customers office and this might take a while to get done.

I'm still up for it, but this week might be a little busy.

blag commented 7 years ago

Whoops, I hit the wrong button there.

Sorry I haven't been attentive lately - job interviews. I should have some free time to check this out next week.

george-silva commented 7 years ago

@blag no problem. I'll get back to this next week. These two past weeks I have been traveling extensively.

blag commented 7 years ago

@george-silva This looks good so far, except for the changes in test_project/test_app/settings.py. Is there some reason you're changing those in this PR?

I'm still interviewing for jobs, but I might have time to flesh out the import script a bit more in the next few weeks.

adamhaney commented 5 years ago

Hello all, I've just started helping out with project maintenance and I'd like to ask, is this PR dead? If someone is still working on it I'll gladly keep it open but otherwise I'm going to close it to clean up dangling PRs. If I don't hear back by in the next 7 days I'll assume that this work has been abandoned.

Thanks,

Adam