Closed coderholic closed 9 years ago
It's just for performance. I figured since the data sets are huge, it would be better to split them up into multiple tables, with the 70MB postal codes split by country and the 250MB alternate names split by language and geotype.
I figured that for most queries on a postal code you would be searching a specific country for a code, rather than searching all countries for a code. With this setup the primary keys can be the codes themselves.
Similarly for alternate names I figured in the general case you would be targeting a specific geotype and language (ex. alt names of a City in French) rather than searching through all geotypes and languages. Under this setup each alternate name can have a direct foreign key to its geotype, whereas under one table it would have to use a GenericForeignKey, and then Country, City, etc. would all need a reverse GenericRelation. Users are also browsing your site in one language, and you're usually querying one language at a time (ex. if you detect Japanese characters in the input then search the Japanese table), so I've optimized the database design for querying one language at a time.
The downside to this approach is a plurality of tables and some limitations on cross-country, cross-language searching.
The magic definitely makes the code harder to read, and harder to maintain (eg. Postalcode can't take advantage of the new Place base class), so I'd really like to explore ways of replacing it.
For PostalCode it's not too much of a problem. You'll get just as good performance as separate tables by putting a composite key on code and country. The only downside is that Django still doesn't support composite primary keys. To ensure consistent and unique primary keys we could have a Save method that sets the pk to the concatenation of code and country, or generate a hash or UUID based on the code and country. It's not ideal to have duplicate data I think it's preferable to the magic, not to mention the other benefits of having all of the postcode data in a single table.
For AlternativeNames it is a little more complicated due to the GenericForeignKey (although the performance of that won't be so bad with the new prefetch_related in Django 1.4). We could have an AlternativeName base class, and then a separate AlternativeName class for each geotype (eg. CityAlternativeName, RegionAlternativeName etc). A simpler approach might be to have AlternativeName as a ManyToManyField on the Place base class, and then Django will automatically create tables for each pairing, with the tradeoff that there's no database constraint to enforce a single AlternativeName to a single Place, and only one AlternativeName and locale pair for a single Place. I think my preference is probably to go with a GenericForeignKey, as it makes the least compromises, gives the cleanest code, and with prefetch_related I think the performance should be good enough for most use cases.
Are there any other alternatives I haven't thought of, or potential problems I've missed?
The generic key will work fine. Yeah the postal code pk can just be the country code + postal code, that will ensure the rows match up with geonames when updating.
Can you handle this redesign? This is a bit much for me since I'm not going to be using cities, I just didn't want to saddle you with a botched design like the hierarchy issue.
There's some code in util that creates the models dynamically, you can remove that stuff. Oh, and one simple thing to add is that since the alt names and postal codes are optional the import functions should check the settings before downloading -- no point in downloading those large files if there's nothing to import.
BTW when you're ready to upload your version to PyPI just tell me and I'll remove my cities package off there.
Yeah sure I can make the changes. Thanks for everything you've done so far! I'll update this ticket once I've made the changes.
The only downside is that Django still doesn't support composite primary keys. To ensure consistent and unique primary keys we could have a Save method that sets the pk to the concatenation of code and country, or generate a hash or UUID based on the code and country.
This is true, but you should not be using composite public keys for a reason! they force you to store data in two places at one for every record that would reference postalcode. Instead, consider using a normal auto-int primary key and setting unique_together on postcode and country. Django will generate the uniqueness constraint and postgres will use that hint to generate a composite index for those two keys on the table, greatly speeding up lookups for the country/postcode pair.
For AlternativeNames, having a table per geotype is perfectly fine. great even! however the locale split makes it hard to make queries that intelligently group on lang. So that can definately be refactored,
The problem with an auto-int primary key is that the key won't be consistent if the source data changes and you reimport the data, or if you add data from different countries in a different order. I'd definitely prioritise consistency and indempontency over speed and space. Having said that, the primary key for PostalCodes is currently just the code. That's fine as long as you don't import data for two countries with the same postalcodes (not sure if there are multiple countries with the same codes?). I'm definitely open to alternative suggestions.
I've yet to make any changes to the AlternativeNames stuff, so suggestions there are very much welcome.
@coderholic auto-int primary keys do not preclude consistency, especially if you take the time to implement the natural keys machinery on the altname tables. When re-importing data, a lookup is never done against the primary key, the lookup is done against the postcode which happens to be the primary key. The auto-int PKs are there for the database's benefit, not the programmer's; so when doing re-imports and modifications, just don't look at the PKs. They take care of themselves. If you are ever having to massage PKs in import scripts, you are doing it VERY wrong.
The issue is with ForeignKeys (ie. if the PK changes for a certain place due to reimporting the data then there'll be an inconsistency). It's unlikely to be a very common operation, and it's one that some databases won't even allow you to do due to foreign key constraints, but that's the idea behind having consistent PKs.
The PK should just be the default auto-int, when re-importing build a _postal_code_index[country_code+postalcode] for all postal code objects, much like how the country and region indexes are built. Use the postal code object in the index if it exists, otherwise create a new postal code.
BTW in the code changes you have:
pc.region = country.region_set.get(name=items[3])
pc.subregion = pc.region.subregion_set.get(name=items[5])
I know it's simpler, but the old code is more robust because it uses region codes instead of names and logs any errors. It's also much faster because it uses the _regionindex instead of db lookups:
# Find region, search highest level first
item_offset = 4
for level, level_name in reversed(list(enumerate(Region.levels))):
if not items[item_offset+level*2]: continue
try:
code = '.'.join([country_code] + [items[item_offset+i*2] for i in range(level+1)])
region = self.region_index[code]
setattr(pc, level_name, region)
except:
self.logger.log(logging.DEBUG if level else logging.WARNING, # Escalate if level 0 failed
"Postal code: {}, {}: Cannot find {}: {}".format(pc.country, pc.code, level_name, code))
Also, the postal code region names need to be put back because many regions aren't in the db, in fact the db doesn't even have the districts. In geonames there's no direct linkage between postal codes and regions:
For many countries lat/lng are determined with an algorithm that searches the place names in the main geonames database using administrative divisions and numerical vicinity of the postal codes as factors in the disambiguation of place names. For postal codes and place name for which no corresponding toponym in the main geonames database could be found an average lat/lng of 'neighbouring' postal codes is calculated.
# models.py...
# Region names for each admin level, region may not exist in DB
region_name = models.CharField(max_length=100, db_index=True)
subregion_name = models.CharField(max_length=100, db_index=True)
district_name = models.CharField(max_length=100, db_index=True)
# cities.py...
pc.name = items[2]
pc.region_name = items[3]
pc.subregion_name = items[5]
pc.district_name = items[7]
# admin.py...
list_display = ['code', 'name', 'region_name', 'subregion_name', 'district_name']
search_fields = ['code', 'name', 'region_name', 'subregion_name', 'district_name']
@coderholic
The issue is with ForeignKeys (ie. if the PK changes for a certain place due to reimporting the data then there'll be an inconsistency). It's unlikely to be a very common operation, and it's one that some databases won't even allow you to do due to foreign key constraints, but that's the idea behind having consistent PKs.
Why would the PK ever change? The pk is just the table's linking value. Consider this:
@twoolie The PKs of the postcode table could change if you reimported them and decided to include additional countries. eg. Initially import postcodes for just the US and PK1...N are created for the US postcodes. You later decide to add Austrailian postcodes too, and to recreate all of the postcode table. The Austrailian postcodes are no PK1...N and the US postcodes are PKN+1...PKM, so any foreign keys to the postcode would be broken.
@Kometes Why does the region name string need to be put back? It doesn't seem to matter if the region is in the database or not - if it is it'll get assigned, and if not it won't. Having a region FK and a separate region name is only likely to cause confusion.
@coderholic.
on_delete=NULL
)@twoolie I'm totally with you, but it is something that's come up before. See https://github.com/coderholic/django-cities/pull/29
@coderholic from the linked pull:
The issue isn't an index, it's that every time you re-import the postcodes the id and postcode might match up differently.
I don't see how this can happen. The default behavior of django when deleting rows is to cascade deletes to rows linked by primary key.
For example last time round I had GB postcode "SW20 8SF" with a primary key of 707662. Next import round, primary key 707662 pointed at TN1 6EG or something. If you assign postcodes by ForeignKey you would want them to stay the same surely?
Yes, but all rows pointing at the old PK (707662) would have been deleted! and when adding the new postcode with a new key (7077662) then you would also need to regenerate anything pointing at it anyway!.
The better solution is to regenerate using get_or_create with the query part based on the postcode. This will update all postcodes in place with out changing any foreignkey links!
@twoolie Only if you delete it via the Django ORM, not if you delete the rows from the database directly, or drop the table and recreate it. To be clear I'm totally in support of your suggested change, I'm just providing a bit of background on why it's not currently like that. Input from @Kometes here would be good though, as he made the change away from an auto-inc PK.
I changed the PKs to match geonames so that consistency is guaranteed in all cases of dropping / re-importing. What's the problem with having the PK as country_code + postal_code? Is it really that much slower?
@Kometes The problem is that not all of the gis backends support composite primary keys, but all support composite indexing. All support enforcing auto-int PK constraints (ON DELETE CASCADE is DB level, not ORM level) . (Unless your mysql and you decide to turn that off :angry:).
@Kometes sorry I couldn't find another way to message you - I want to push the latest version of pypi but you're the owner. Can you assign ownership back to me, or delete the existing project? Thanks!
Currently the optional postcode and alternative name models are dynamically generated based on configuration details provided in settings.py, and a separate model is created for each of the different configuration options. I think the code would be much more readable and maintainable if this magic is removed.
@Kometes sorry for not flagging this at the time you issued your pull request - I should have been paying closer attention but was super busy at the time. It'd be great it you could give me some insight into why you took this approach, and why you feel it works better than having single PostalCode and AltName models that have a country and locale field respectively, instead of a model for every country and locale?