agrc / forklift

:tractor::package::sparkles: Slinging data all over the place :tractor::package::sparkles:
MIT License
27 stars 3 forks source link

Preserve Global IDs #341

Closed stdavis closed 7 months ago

stdavis commented 3 years ago

From the illustrious @gregbunce:

image

I wonder if this setting should be enabled globally in the base forklift project, or just in the AGOL pallet.

The docs mention that this setting only applies to enterprise geodatabase data when using the append tool. I'm not sure if that refers to the input, output, or both parameters. Forklift does use the append tool to update destination data.

@gregbunce Have you tried this setting when using the append tool to pull data from internal into a FGDB?

gregbunce commented 3 years ago

Yes, I do use it when pulling data from Internal into fgdb and it works as advertised - globalids are preserved..

thanks for getting this going. i think preserving these IDs is a huge step in getting us to the next step with data collaboration.

stdavis commented 3 years ago

Looks like we may run into issues if the source dataset doesn't have the appropriate index on the global id field. Here's my test:

import arcpy

dest = r'W:\forklift\samples\data\sampledestination.gdb\Counties'
source = r'W:\forklift\samples\data\internal.sde\SGID.BOUNDARIES.Counties'

arcpy.env.preserveGlobalIds = True

print('truncating')
arcpy.TruncateTable_management(dest)

print('appending')
arcpy.Append_management(source, dest, 'NO_TEST')

which threw this error:

truncating
appending
Traceback (most recent call last):
  File "forklift\test.py", line 12, in <module>
    arcpy.Append_management(source, dest, 'NO_TEST')
  File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 6153, in Append
    raise e
  File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 6150, in Append
    retval = convertArcObjectToPythonObject(gp.Append_management(*gp_fixargs((inputs, target, schema_type, field_mapping, subtype, expression), True)))
  File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py", line 512, in <lambda>
    return lambda *args: val(*gp_fixargs(args, True))
arcgisscripting.ExecuteError: ERROR 003340: The target dataset must have a GlobalID field with a unique index in order to use the Preserve GlobalIDs geoprocessing environment setting.
Failed to execute (Append).

Maybe a try/catch that searches for the error number and retries the append without the env variable set?

I also realized that forklift only uses truncate table when it's wired up to change detection. For everything else, it uses update and insert cursors so I ran this test:

import arcpy

dest = r'W:\forklift\samples\data\sampledestination.gdb\Counties'
source = r'W:\forklift\samples\data\internal.sde\SGID.BOUNDARIES.Counties'

arcpy.env.preserveGlobalIds = True

print('truncating')
arcpy.TruncateTable_management(dest)

print('insert cursor')
fields = [f.name for f in arcpy.ListFields(source) if not f.name.startswith('Shape')] + ['Shape@']
source_rows = arcpy.da.SearchCursor(source, fields)
with arcpy.da.InsertCursor(dest, fields) as cursor:
  for row in source_rows:
    cursor.insertRow(row)

source_rows.reset()
print(next(source_rows))
print(next(arcpy.da.SearchCursor(dest, fields)))

which didn't throw any errors, but the global ids were not preserved:

truncating
insert cursor
(12, '03', 2010031010.0, 2010.0, 'CACHE', 5.0, 'North', 112656, 128289, '{AD3015BE-B3C9-4316-B8DC-03AFBB56B443}', '49005', 2, <Polygon object at 0x294afcc1908[0x294a738fb70]>)
(1, '03', 2010031010.0, 2010.0, 'CACHE', 5.0, 'North', 112656, 128289, '{87545C8A-D34D-43E1-A336-7285E4D0B194}', '49005', 2, <Polygon object at 0x294afd61648[0x294af944f90]>)
gregbunce commented 3 years ago

that makes sense. speed bump. i wonder if we iterate through all the layers in Internal where GlobalID has no unique index and then assign a unique index to that field.

also, should we ensuring all layers in Internal have a GlobalID?

steveoh commented 3 years ago

If you did that the truncate and load update methods would keep creating larger and larger ids. I think some of our work flows would need to change.

gregbunce commented 3 years ago

creating larger objectids?

steveoh commented 3 years ago

creating larger objectids?

Global ids. Which kind of makes them useless right?

gregbunce commented 3 years ago

yeah, i see your point: if the data is updated in Internal via truncate and load then the globalids continually get recreated, making it only useful within the current update cycle.

hmm... trying to tie this idea of consistent globalids across datasets (Internal, AGOL/OpenData, fgdb downloads, etc) into your chalkdust remake as well as promoting more collaboration/feedback with our users. ...using a stable id to keep us all on the same page as to what feature we're all talking about.

for the roads data, the globalid works b/c I'll be preserving the globalids in Internal based on what's in UTRANS, where they are generated. this will allow us to have a consistent and stable globalid across Utrans, Internal, and the NG911 datasets. This globalid is used in the NG911 error reporting and it helps tie the feedback/fixes back to the source utrans data. It would be nice if the roads globalid was consistent in AGOL/OpenData as well. again, thinking about feedback and collaboration in the AGOL/OpenData world... users can recommend changes/fixes based and we can tie it back to the utrans segment based on the globalid.

longwinded, I know. but, your point is taken. aside from the roads data, the current update process of truncate and load kind of throws a monkey wrench into this for most other datasets.

Given, it would still be nice to preserve the ids (keep them consistent between Internal and AGOL/OpenData) for the roads dataset as those will remain stable.

to make that happen, i guess that means we'd have to ensure that all existing data in Internal that has a GlobalID field has a unique index on it, right? that would prevent the preserve ids environmental variable from erroring, right (assuming the roads data is updated in AGOL/OpenData via the append method @stdavis described above - and not the insert cursor method)?

stdavis commented 3 years ago

I think that there is a possibility of making this work for roads at least in the short term. I think that we should add some checking to forklift for global id fields and the required index. Then apply the preserve ids env var for each append individually. This wouldn't fix it for stuff that's updated via the insert cursor, but that definitely doesn't apply to roads or anything else coming out of internal.

I did verify that global ids are preserved when forklift pushes the data to AGOL by verifying that they match between features from the forklift staging FGDBs and the corresponding hosted feature layer in AGOL.

I'll work on a PR to make this happen.

stdavis commented 8 months ago

We are running into this same issue in the WRI forklift instance. Global IDs are not being preserved in the Range Trends data (SiteInfo). This is a problem because they are used to link to the rand trends site (e.g. https://dwrapps.utah.gov/rangetrend/rtstart?SiteID=%7B3E045EEF-FB32-47FD-8BCD-8BB4501F58E6%7D).

This instance does not use change detection.

I wonder if the fix is to apply the same logic in change detection: https://github.com/agrc/forklift/blob/d415ed4fe2d88b4c966b9ec38d6fa43ca654dc4f/src/forklift/change_detection.py#L71-L84

to the core forklift update: https://github.com/agrc/forklift/blob/d415ed4fe2d88b4c966b9ec38d6fa43ca654dc4f/src/forklift/core.py#L105

steveoh commented 8 months ago

Could we calculate a field with those values as a string or something to preserve them for now?

stdavis commented 8 months ago

Could we calculate a field with those values as a string or something to preserve them for now?

I think that could be a good fallback but I think that I've come up with a solution. I'm about to submit a PR.