Strip down Ashlar data model

jeancochrane commented 6 years ago

Ashlar's models currently include tables and fields that are specific to the DRIVER implementation. For better abstraction, these tables should be removed from Ashlar and moved to DRIVER.

jeancochrane commented 6 years ago

https://github.com/azavea/ashlar/pull/85 goes a long way toward stripping down the data model. There are a few fields in the models that I'm still unsure about, however. I'm curious to hear from @ddohler about which of these fields is necessary for Ashlar.

`Record`

Extraneous location fields:

location_text = models.CharField(max_length=200, null=True, blank=True)
city = models.CharField(max_length=50, null=True, blank=True)
city_district = models.CharField(max_length=50, null=True, blank=True)
county = models.CharField(max_length=50, null=True, blank=True)
neighborhood = models.CharField(max_length=50, null=True, blank=True)
road = models.CharField(max_length=200, null=True, blank=True)
state = models.CharField(max_length=50, null=True, blank=True)

objects field (not sure what this does?):

objects = models.GeoManager()

`Boundary` and `BoundaryPolygon`

Are the Boundary model and the BoundaryPolygon model necessary for Ashlar? These seem DRIVER-specific. If we decide that we indeed want to keep these in Ashlar, then we should bring the geography upload component into the schema editor.

ddohler commented 6 years ago

Awesome, thanks for digging into this. I think that location_text is a good one to keep around as long as we're limiting ourselves to Point geometries -- many, if not absolutely all, applications that want to store point geometries will benefit from having a text address along with the geometry. However, I had forgotten about city, city_district and friends, and I think those can be removed -- they were specific to DRIVER. I think they were added to make the CSV exports easier to work with, but I don't remember perfectly.

I think the objects field needs to stay; that specifies that when you do Record.objects.filter() you get an instance of GeoManager rather than the default Django Model Manager.

As far as Boundary and BoundaryPolygon, I could go either way. On the one hand, they're obviously distinct from the core ~Ashlar~ Grout functionality. On the other hand, filtering points that lie within a particular polygon is a super common requirement and it would be nice not to have to build that functionality from scratch for every new Grout project. So I think the main argument for taking them out would be that we should keep Grout very simple at the start in order to stay efficient, and we can add functionality later. The main argument I see for keeping them is that they already exist and they seem pretty likely to be useful in the future, so we're just making more work for ourselves if we take them out now because we'll probably have to add them back in later. What do you think?

jeancochrane commented 6 years ago

Thanks for the detailed data model audit @ddohler! Agreed on location_text, that definitely makes sense to keep.

objects seems like it's deprecated as of Django 1.9, since GeoDjango now has a different approach to database queries. I'm not totally sure how you're supposed to migrate to the new approach, but between the old docs for GeoQuerySet and the new docs for GeoDjango database functions I think we can manage to strip out objects and move to the new database functions. If you agree then I can go ahead and include that as part of the work of this issue.

I agree with your point about Boundary and BoundaryPolygon being pretty basic components of any geospatial application, so I can't really think of a good reason to decline to include them out of the box. One thing I'm wondering about, however, is that if we do choose to extend ~Ashlar~ Grout to store polygons, then what role will the Boundary models play? One interesting approach might be to factor out all geospatial data into their own models (i.e. make a model and related manager for Point in addition to Boundary) and then turn Record.geom into a foreign key referencing one of those models. That seems like a pleasing abstraction, but it also might make filtering/searching a little more difficult -- something to consider if/when we decide to start working on supporting polygons in Records.

One follow-up question on Boundary models: what's the purpose of Boundary.data_fields and BoundaryPolygon.data? Is that a distinct schema/record pattern for boundaries?

ddohler commented 6 years ago

Oh, that's really interesting -- I haven't had to use the new GeoDjango functions yet, but that seems like a much better way to go about spatial lookups; I really like Django's database function API as far as I've encountered it in other contexts. Based on those links it sounds to me like we can indeed remove objects. I don't think we're using too many of the legacy functions that have been removed, so I'm hopeful that it'll be an easy migration.

The idea about making geometries a foreign key is interesting, but I do agree -- it could make filtering kind of clunky. My (fairly hazy still) idea for adding support for other geometries was slightly different; I was thinking of making the flexible schema part into an abstract base class, and then creating several different Models that inherit from it. So maybe something like this (very off the cuff):

class FlexibleSchemaRecord(models.Model):
    schema = models.ForeignKey(RecordSchema)
    class Meta:
        abstract = True

class PointRecord(FlexibleSchemaRecord):
    geom = models.PointField()

class PolygonRecord(FlexibleSchemaRecord):
    geom = models.PolygonField()

...

I haven't put a lot of thought into how filtering and such things would work but anyway, just another possible option for when/if we start on that.

The data_fields and data fields are used to support importing arbitrary Shapefile data into a table with a fixed column layout. Shapefiles can have any number of columns, so those two fields allow us to store the raw Shapefile columns in a JSONB store so that we can import any Shapefile that a user uploads. I believe it works this way: data_fields stores an array of strings denoting which columns are available in the Shapefile. data stores a dict with the keys corresponding to column names and values corresponding to the value for a particular geometry. So it's not quite as fancy as the full Grout schema design, but it serves a similar purpose, tailored to the needs of importing Shapefiles.

jeancochrane commented 6 years ago

This is in much better shape thanks to https://github.com/azavea/grout/pull/4 and https://github.com/azavea/grout/pull/5. The remaining work we discussed here is reflected in #31. Closing!

azavea / grout-2018-fellowship