crackernutter / EsriRESTScraper

A Python class that scrapes ESRI Rest Endpoints and exports data to a geodatabase
MIT License
48 stars 20 forks source link

Kudos and questions and possible issue #4

Closed pkdavidson closed 7 years ago

pkdavidson commented 7 years ago

Awesome work! This needs more widespread distribution.
I'll post a comment on GeoNet in the next few days. I had Googled for something like this in the past few months without any luck. This time my query found your work, thankfully, as it should save me a few days of effort (or few weeks..)

I have a few questions and potential issues.

I have a REST endpoint I need to query and pull data inside our Portal. The full path w/ the shortest query that gives me what I need is: 'http://coagisweb.cabq.gov/arcgis/rest/services/public/Barricades/MapServer/8/query?where=FacilityOwner+LIKE+%27%25ABCWUA%25%27+AND+Timing+%3C%3E+%27PastPermit%27&outFields=*&orderByFields=dmdsde.dmd.v_EdgeSoft_Line_Permits_PROD.DATE_Created&f=json'

To put that query into what I first thought should be your query format: queryList = ["FacilityOwner LIKE '%ABCWUA%'" , "Timing <> 'PastPermit'"]

First question: After digging around a bit, it looks like your concept of a query list is done in order to return blocks of data to deal with the standard 1000 record limit. The idea is not to build a series of ANDs or ORs that create a "midly complex" query but to chunk the queries into blocks of data.

Hence, in my case, I should be using: queryList = ["FacilityOwner LIKE '%ABCWUA%' AND Timing <> 'PastPermit'"] rather than what I wrote previously. Correct?

Second Question (or call it a suggested improvement and a quick hack question): To add in the orderByFields I am utilizing above, I would just need to modify your rValues as:

rValues = {"where":query, "f":"json", "returnCountOnly":"false", "orderByFields":orderBy, "outFields": "*"}

And modify the function definition as: def updateFeatureClass(self, featureClass, query=["1=1"], append=False, userFields=[], orderBy=""):

This of course is a pretty quick hack. Your code looks good so it shouldn't be too painless to add in some error checking on orderBy. I think just need to verify it's a sting?

Ditto for many other fields one might need in the rValues?

Final Question: I noticed your userFields[] in the function definition of updateFeatureClass. Can you expand on that? Is it something we should be aware of? As near as I can pickup from a quick code glance, this is a list of fields to be ignored in creating or updating the Feature Class?

Final note: Your example map service appears to have been renamed from: nhss_haz to igems_haz

At least that simple mod worked for me.

Now for my two issues: I've run into the same problem previously described with "." being transfered to "_" in the FGDB schema. I believe the "." in names is artifacts from prior earlier Esri Versions (9.x or earlier.) In fact, the source of the data is from one of Esri's first 10 user accounts so there's no telling how far back the schema might go. The server is 10.2.2

I did quick hack on the code to convert any "." from the REST point into "_" in order to avoid a schema mismatch in the FGDB. From what I've come across so far, in order to deal with the ".", I have to add one line of code to two locations.

Then I was getting Geometry errors. I tracked it down to what I'm pretty sure is a typo in your __getGeometry function. for PolyLine, instead of getMultiGeometry, it's getMuliGeometry. Which tosses an undefined name error.

I'll spend more time on this tomorrow at work and see if the simple hacks I made need more effort.

It does look like I have this working now. I'll try to do it right and push it back at you with the mods in place so you can decide if they need more effort. Or how you would want to integrate them into your code, if at all.

Cool stuff. Just saved me a lot of work ! Our Portal is https and a lot of the local published data is http only so it's often easiest to bring things in house.
Great work for your first GitHub share!

crackernutter commented 7 years ago

Paul - thanks so much! I'm really glad folks find this useful. It's proven invaluable to my work. I'll try to address your questions here:

Regarding queries, you are correct, the idea of an array of queries is to allow a user to input an array of multiple, non-overlapping queries in case a single query returns more than 1000 records. Your first example, ["FacilityOwner LIKE '%ABCWUA%'" , "Timing <> 'PastPermit'"] essentially translates to "FacilityOwner LIKE '%ABCWUA%' OR Timing <> 'PastPermit'" because it queries the endpoint which each of those queries serially instead of what you want, "FacilityOwner LIKE '%ABCWUA%' AND Timing <> 'PastPermit'". Hope that makes sense.

You can certainly add an orderBy parameter to the main function, but since I am just pulling the data into my own internal geodatabase, the order of the results is not important (since any geoprocessing or operations done on the table will ostensibly be done after the data is pulled in). You might want to add a time filter or something else to limit your results, but ordering the results doesn't really change anything. If you ever instantiate a cursor on your table, you can order the results there, but for just pulling in data, I don't think it's necessary.

The userFields is something I added much later to this - and you are right, it's a list of fields to be ignored, fields that don't exist in the source dataset but do in the destination. Since I'm using this to hit public sites, and occasionally the schema of the public REST endpoints I'm caching occasionally change, I had added in a SchemaMismatch exception to alert the user if a REST endpoint did change from their destination table. However, I found later that I wanted my destination table to have some additional fields. For instance, a common workflow might be a) cache records from a public rest endpoint to an internal table, and b) perform some other operation on the records that computes the value of a different field (a field that doesn't exist in the source table). So I added the userFields parameter so the function wouldn't throw a SchemaMismatch error when it sees there are additional fields in the destination table and not the source. Hopefully that makes sense too!.

Thanks for alerting me to the two issues. I changed the MuliGeometry to MultiGeometry, and I'll update the ReadMe with the new REST endpoint. And I've never run into the issue with "." being converted to "_" (at least I don't remember), but if that is happening, a fix is most appreciated!

Thanks again for the feedback!

pkdavidson commented 7 years ago

Hey Jay: Thanks for the quick response, your answers all make sense. I agree the orderBy is not a necessary function since you can always add an index, etc... on the new side. OTOH, I've found that starting with clean, ordered data is often a more optimal setup. In our case, we're talking a few hundred records, max so me worrying about optimal is a waste of my time. ;-)

The use of the userFields is a great idea and greatly increases the functionality. Like you said, seems like most times we pull others data, we often end up with some added fields in order to increase functional use of the data.

I'll pull a Fork and look into putting my minor mods for the simple fix for the "." to "" I should vet them a bit more and they maybe so unique that rather than putting them into the master, you would just leave them documented in this thread.
I'll have to think about this some more. I think the question is, does anytime there is a "." in a field (attribute) name, does that indicate a map service created from an older dataset and will that "." always be converted to a "
" in the new FGDB. If so, my simple fix might take care of this. This will require some digging into Esri history I suspect. And then the question is it worth it or is better to just be aware of the issue and modify your code accordingly on a case by case basis.

Or add in a class value that indicates to utilize the fix, etc... I might go that route.

Basically I modified two functions: updateFeatureClass(...) and the instantiate cursor section

instantiate cursor

updateFields = [f['name'] for f in self.updateFields]
cursor = arcpy.da.InsertCursor(featureClass, updateFields)

as follows by adding in one line:

instantiate cursor

updateFields = [f['name'] for f in self.updateFields]
updateFields=[f.replace('.','_') for f in updateFields] # pkd, 9/25/16 cursor = arcpy.da.InsertCursor(featureClass, updateFields)

You might vet that line of code as I'm an old school C guy so the new pythonista lingo is still somewhat foreign to me. I basically guessed at that code and tried it out and verified it did what I wanted via a debugger and watch variables.

I did the basically identical thing in: __matchSchema(....)

with the code just before the return: fClassFields.insert(0, 'Shape@') objFields = [f['name'] for f in self.updateFields] return sorted(fClassFields) == sorted(objFields)

added in the line: fClassFields.insert(0, 'Shape@') objFields = [f['name'] for f in self.updateFields] objFields=[f.replace('.','_') for f in objFields] # pkd, 9/25/16 return sorted(fClassFields) == sorted(objFields)

One thing that occurs to me as a possible useful improvement that would probably deal with this issue in a more robust and useful way would be to allow for some way to map field names from the REST endpoint to the FGDB.

For example, field names in the published map service I'm grabbing from are somewhat lengthy: 'dmdsde.dmd.v_EdgeSoft_Line_Permits_PROD.DATE_Created'

We can alias them on our end in the FGDB and in the web app we'd publish but I'd be happy to just have them much simpler in the Schema, for example: DATE_CREATED

This would probably require a fair bit of work to make it generic enough to be useful. Something like that could probably worked out such that you could only bring over certain fields rather than all the attributes. Pie in the sky...

For now, I'm going with my minor mods above.... :-) This has already saved me a lot of effort...