stable identifier for parcels "across runs"

tbuckl commented 9 years ago

@fscottfoti @mkreilly @janowicz, we've talked a lot about this feature, so it makes sense to put some notes here.

@fscottfoti's current thinking is that hashes of centroids for the parcels should be unique. And that because that is the case, the user will always be able to say whether a parcel from a given run is identical to a parcel from another run.

But I'm still a bit unclear on how to write the story for this feature, so I don't know how we will say when it is complete.

I think the story is: as a person that is modeling the state of parcels over time, i would like to be able to say whether any given parcel that i am describing is identical to another parcel, so that i can improve the quality of the predictions that i am making about the state of that (all?) parcel(s).

It seems that one issue was that a user would assign attributes to a parcel at some point in the modeling process, and they they would later try to apply those attributes to another set of parcels and be unable to do so because the parcel table had changed, and therefore unique identifiers changed. @mkreilly could you clarify on what percentage difference or similarity would be acceptable when joining parcels across tables? That might help us define what the successful completion of this story is like.

One previous attempt at keeping a parcel's ID the same was to keep an ID column on the table that had a unique name which was generated in some early process, and then just make sure that that ID column remained on the table in all cases where parcels were used in the modeling process.

Another approach is to use the hash of the geometry column. For example. However, when we compared the geom_id's from @janowicz's (Windows 7) laptop to those generated by the MTC Windows Server 2012, only 1/3 of the parcels were exactly identical. On the other hand, across Linux machines built in exactly the same way, more than 95% of the geometry ID's are identical.

Other ideas for keeping a parcel's ID the same include using a geohash or similar.

There might be 2 notions of time that are relevant: parcel time and database time. For example, lets assume that parcel A that has an attribute something=1 at time-1 in the parcel table. If we discover, at time-2, that we were incorrect, and that in time-1, parcel A in fact had something = 2, do we revise the time-1 parcel table? Or do we only resolve the time-2 table? This could be more complicated if something is actually the geometry of the parcel, or if the parcel splits.

janowicz commented 9 years ago

Fletcher and I worked a bit on this earlier in the week- we tried an approach involving the centroid coordinates and the area of each parcel (and controlling the precision of each to ensure consistency across machines). Create a string with x, y, and area concatenated together (each with defined precision), and then hashing this string. This ended up being unique for all parcels except one pair (and we can drop one of these two parcels since they nearly completely overlap).

The next step for me (to be done today or tomorrow) is test runs of data regeneration to test this on different machines to check that the resulting id's are the same. Our hypothesis is that controlling the precision will help to achieve the same id's across machines.

We added area in because after visually examining where centroids were falling, there were examples of centroids being essentially in the same place even though parcel geometry different (for example in cases where parcel representing common area surrounds a parcel.). Adding in area differentiated the parcels in the examples that were visually examined, and this was confirmed after looking at the resulting id's.

On Thu, Jul 2, 2015 at 8:10 AM, Tom Buckley notifications@github.com wrote:

@fscottfoti https://github.com/fscottfoti @mkreilly https://github.com/mkreilly @janowicz https://github.com/janowicz, we've talked a lot about this feature, so it makes sense to put some notes here.

@fscottfoti https://github.com/fscottfoti's current thinking is that hashes of centroids for the parcels should be unique. And that because that is the case, the user will always be able to say whether a parcel from a given run is identical to a parcel from another run.

But I'm still a bit unclear on how to write the story for this feature, so I don't know how we will say when it is complete.

I think the story is: as a person that is modeling the state of parcels over time, i would like to be able to say whether any given parcel that i am describing is identical to another parcel, over time.

It seems that one issue was that a user would assign attributes to a parcel at some point in the modeling process, and they they would later try to apply those attributes to another set of parcels and be unable to do so because the parcel table had changed, and therefore unique identifiers changed. @mkreilly https://github.com/mkreilly could you clarify on what percentage difference or similarity would be acceptable when joining parcels across tables? That might help us define what the successful completion of this story is like.

One previous attempt at keeping a parcel's ID the same was to keep an ID column on the table that had a unique name which was generated in some early process, and then just make sure that that ID column remained on the table in all cases where parcels were used in the modeling process.

Another approach is to use the hash of the geometry column. For example https://github.com/synthicity/bayarea_urbansim/blob/a0cdcee377500198645d468e130541e32a08a3dd/data_regeneration/match_aggregate.py#L762-L771. However, when we compared the geom_id's from @janowicz https://github.com/janowicz's (Windows 7) laptop to those generated by the MTC Windows Server 2012, only 1/3 of the parcels were exactly identical. On the other hand, across Linux machines built in exactly the same way, more than 95% of the geometry ID's are identical.

Other ideas for keeping a parcel's ID the same include using a geohash http://postgis.net/docs/manual-dev/ST_GeoHash.html or similar.

There might be 2 notions of time that are relevant: parcel time and database time. For example, lets assume that parcel A that has an attribute something=1 at time-1 in the parcel table. If we discover, at time-2, that we were incorrect, and that in time-1, parcel A in fact had something = 2, do we revise the time-1 parcel table? Or do we only resolve the time-2 table? This could be more complicated if something is actually the geometry of the parcel, or if the parcel splits.

— Reply to this email directly or view it on GitHub https://github.com/synthicity/bayarea_urbansim/issues/56.

janowicz commented 9 years ago

The function to be tested:


def generate_unique_id_from_geom(x, y, area, precision = 5, hash_values = False):

    # Keep only non-null values
    x = x[~x.isnull()]
    y = y[~y.isnull()]
    area = area[area > 0]

    #x, y, area
    identifier = x.round(precision).astype('str') + ',' + \
                 y.round(precision).astype('str') + ',' + \
                 area.round(precision).astype('str')

    if len(identifier) > len(np.unique(identifier)):
        print 'Non-unique id values present.  %s rows and %s unique values.' % (len(identifier), len(np.unique(identifier)))

    if hash_values:
        identifier = identifier.apply(get_md5_hexdigest)
    return identifier

tbuckl commented 9 years ago

Thanks @janowicz. I think that functionally thats very similar to this approach.

In any case, I think the key here will be writing tests.

akselx commented 9 years ago

Remarkable that centroids could be identical for parcels with different geometries.

This may be a hack solution that may be vulnerable if precision indeed varies by system. It relies on PostGIS' own ST_AsEWKT which renders a geometry (not just the centroid here) as a string, and then I pull that into pandas and hash it from there.

import hashlib
from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:xxxxxx@localhost:5432/gisdb')
county_area = pd.read_sql('select gid,ST_AsEWKT(geom) from zones1454',engine)
def hashbrown(s):
    hash = hashlib.sha1(s)
    hex = hash.hexdigest()
    return hex

county_area.st_asewkt.apply(hashbrown).head()
Out[20]:
0    32d484fa10ca4ea9f91dd385759ef0e3e57524c2
1    cb69cf3774eb47edf949c113f38162a703e0f2ce
2    389f9b65bdb42dcba7a8955301c1969cb0fa0b27
3    90db87b0f2922f3407f7fcc93d2c06b2d060143a
4    b8d5ff99f8d97e678d4d4fa2d506ff3b2de409c4
Name: st_asewkt, dtype: object

tbuckl commented 9 years ago

@akselx thats the approach taken here: https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L762-L768

UDST / bayarea_urbansim

stable identifier for parcels "across runs" #56