agrc / palletjack

A library for updating AGOL data from various external sources
https://agrc.github.io/palletjack/palletjack/
MIT License
12 stars 0 forks source link

Geocode a dataframe using the Web API geocoder #16

Closed jacobdadams closed 2 years ago

jacobdadams commented 2 years ago

Adds an APIGeocoder class for geocoding dataframes. Also adds transform.py module in preparation for restructuring around the ETL steps (probably in v3.0.0)

jacobdadams commented 2 years ago

This is super cool. I wonder if this is a bit of a DOS hazard though.

Yeah, that is a concern. I'm thinking that if you're interested in DOSing us, though, it'd be trivial to either remove any rate limiters/sleepers we put in this, or build your own DOSing script using the API console and/or the example code in the repos from a couple years ago (which I pulled from to create this).

At the risk of passing the buck, can we implement sleeping/abuse prevention on the server side? Somehow track requests per second from a given API key and slow them down after a threshold and softban after a larger threshold?

codecov[bot] commented 2 years ago

Codecov Report

Merging #16 (336e99c) into main (85d4bd0) will increase coverage by 1.62%. The diff coverage is 98.88%.

@@            Coverage Diff             @@
##             main      #16      +/-   ##
==========================================
+ Coverage   87.09%   88.72%   +1.62%     
==========================================
  Files           5        6       +1     
  Lines         527      612      +85     
  Branches       73       85      +12     
==========================================
+ Hits          459      543      +84     
  Misses         65       65              
- Partials        3        4       +1     
Impacted Files Coverage Δ
src/palletjack/utils.py 94.49% <97.91%> (+2.55%) :arrow_up:
src/palletjack/__init__.py 100.00% <100.00%> (ø)
src/palletjack/transform.py 100.00% <100.00%> (ø)
src/palletjack/updaters.py 91.61% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cdf359c...336e99c. Read the comment docs.

stdavis commented 2 years ago

Yeah, that is a concern. I'm thinking that if you're interested in DOSing us, though, it'd be trivial to either remove any rate limiters/sleepers we put in this, or build your own DOSing script using the API console and/or the example code in the repos from a couple years ago (which I pulled from to create this).

I think that the more likely scenario is an inadvertent DOS attack. I wonder if something like this would be worth adding.

steveoh commented 2 years ago

I do plan to implement rate limiting in the API with the new funding but it's not going to happen in the short term.

I do realize it's easy abuse the API at the moment and we've implemented throttles into all the clients we've made so far so I would like to see something here as well.

jacobdadams commented 2 years ago

Ok, that makes sense, I didn't realize the concern was inadvertent DOSing or that we were throttling elsewhere. Thanks for the link, @stdavis. I'll implement something like that.

Also, the CI pipeline is failing with arcgis==2.0. (reported here). I'm going to revert to 1.9. for now.

jacobdadams commented 2 years ago

Ok @stdavis and @steveoh , let me know if you think this rate limiting is sufficient (it's basically the same as the geocoder toolbox Scott linked to).

jacobdadams commented 2 years ago

Belay that; I really should be capturing the other info (score, match addr, etc). More to come...

steveoh commented 2 years ago

I have a hunch the code you've added is not going to make a difference with the amount of requests made. I don't understand how apply works and am having a hard time finding the information I would need to know for sure. I think it will still create a large burst of http requests from all the rows with a given sleep offset which is not what you were intending.

It would be safer in my opinion to loop over the series and perform the geocode with the randomized delay instead of using apply until we are sure how apply executes the function.

When rate limiting is implemented, as a FYI, we will need to handle 429 (Too Many Requests) response codes in the clients we have created.

steveoh commented 2 years ago

So I take that all back. It looks like apply executes in sequence and it should be all good. My test was

import pandas as pd
from time import sleep

df = pd.DataFrame(range(5))

def check(row):
    sleep(1)
    print(f'row {row} finished')

df[0].apply(check)

If you're planning to do more work, will you convert to a draft until you're ready to merge?

jacobdadams commented 2 years ago

Yeah, I've heard df.apply described as a fancy wrapper for iterating over rows (often in the context of how it's a lot slower than a purely vectorized operation).

Forgot about the draft feature for a minute there. Changes are in, reverting to open.

jacobdadams commented 2 years ago

Ok, error handling has been beefed up. How's this look?

jacobdadams commented 2 years ago

Converting to draft to investigate iterating over rows instead of append for progress tracking

jacobdadams commented 2 years ago

Ok, let's get this out the door.