Streets-Data-Collaborative / OpenStreetCam-GeoParsing-Tool

Create a tool that, given a city, give can pull each OpenStreetCam (OSM) Track file No. associated with that city.
Apache License 2.0
7 stars 2 forks source link

Define a getAlltracks() function that takes a city's name as an input, and returns a list of all `sequence_id`s from that city #4

Closed dmarulli closed 6 years ago

dmarulli commented 6 years ago

This function can loop through the lat/lng list returned by getIntersection() to generate inputs for getNearbytracks().

charlie-moffett commented 6 years ago
url = 'http://openstreetcam.org/nearby-tracks'
sequence_ids = []

def getNearbytracks(lat, lng):
    data = {'lat': lat, 'lng': lng, 'distance': '5', 'myTracks': 'false', 'filterUserNames': 'false'}
    r = requests.post(url=url, data=data)
    extract = r.json()
    ### implementing with David
    try:
        sequences = extract['osv']['sequences']
    except:

    for i in range(len(sequences)):
        sequence_ids.append(sequences[i]['sequence_id'])
    return sequence_ids
charlie-moffett commented 6 years ago

I've pushed getAlltracks.py to the repo. I was able to handle deduping by instantiating the container for sequence IDs ('city_sids') as a set instead of a list.

For the city of Sebastopol, which had ~750 coordinate pairs from getIntersection(), I'm getting 20 tracks. Next I'll run it for Berkeley, which has ~6k coordinate pairs; the function took 4 minutes or so for Sebastopol which is why I started with a smaller city. Do you have any advice as to how I might improve performance here?

dmarulli commented 6 years ago

Okay, nice.

Hm, my guess here would be that the rate-limiting step is the network request itself as opposed to anything on our end. Not immediately sure there's much to do in this case without actually altering our basic approach of attempting to collect all the tracks in the city by sending a request for each intersection.

I would say that unless inspiration strikes, let's not worry too much about how long it takes for now and keep moving forward. If it has to run for hours for large cities, that actually isn't so much of a problem because when we do this processing "for real", it will take place on a dedicated remote server not a local laptop.

This does bring up another common web-scraping challenge though that we may need to address: for large cities, we will be pinging the OSC servers with many requests, which they may not like if too many come in too quickly. The simple work around though is to add a short sleep time between each request--.5s or 1s should probably be fine. We don't have to implement this immediately if we are not seeing any errors, but just bringing the problem/solution to the surface in case it does come up.

If things are behaving well outside of speed, let's move on to setting up a function to push up the sequence_ids to a google form (#5) . After digging into this a bit, it's possible you may find you need some credentials. If so, I can get those to you after we know what they are.

charlie-moffett commented 6 years ago

Kicking the tires a bit and things aren't yet behaving well. Will whip into shape before moving to #5.

charlie-moffett commented 6 years ago

I've fixed getAlltracks() and pushed the changes to the repo. Moving on to issue #5 this afternoon - will let you know which if any credentials I need after diving in.