GreenInfo-Network / caliparks.org

Mapping social media in parks and other open spaces
http://www.caliparks.org/
22 stars 3 forks source link

Pull instagram geolocation data from DB for area near Congressional District 39 #664

Closed danrademacher closed 6 years ago

danrademacher commented 6 years ago

This is not directly connected to the website, but does require use of the backing data. Bill to the Caliparks maintenance job in QB

RLF wants us to make a map of connections between the 39th Congressional District and Instagram posts in public parks along the coast. @tsinn will make the final map (similar to a County>Coast thing we did for UCLA last year), but first we need to get some data.

Ultimately, we need all the IG posts in coastal parks where the user's home location is inside the 39th congressional district.

Pulled down current House districts to P (Mac path: P/proj_p_s/ResourcesLegacyFund/CALIPARKS/instagramMapping2018).

But for the county map, Stamen provided the data preprocessed to the best-guess county centroid of the IG user. so first question is what we have in the database to allow that. Maybe a user profile city name or similar?

If it's a string that needs geocoding, this could get time consuming/costly. In that case, we might want to restrict posts to only coastal parks first (with help from Tim to recall how we did that last time) and then only geocode those, then filter to 39th district.

Here's the county map we did last time for reference: image

End product would likely be a single dot in 39th radiating out to many parks.

gregallensworth commented 6 years ago

The coastal_photos table has these fields:

 photo_id     | character varying(40) | 
 metadata     | json                  | 
 geom         | geometry(Point,4326)  | 
 superunit_id | integer               | 

The photo metadata does not seem to include much personal detail suich as home locations, just the poster's username and name:

{
"type":"image",
"id":"1176141650918717185_7070844",
"attribution":null,
"tags":[],
"location":{"latitude":39.392445,"name":"Jackson State Forest","longitude":-123.648923,"id":1024907777},
"comments":{"count":0,"data":[]},
"filter":"Normal","created_time":"1454427032","link":"https://www.instagram.com/p/BBSfpmcjTcB/",
"likes":{"count":7,"data":[
    {"username":"nrcross_","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/s150x150/12224454_182667155409552_1826488624_a.jpg","id":"42351896","full_name":"Nathan Cross"},
    {"username":"dgirard10","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/s150x150/11262620_422668554595655_1255753760_a.jpg","id":"12478249","full_name":"Dylan Girard"},
    ...etc...
]},
"images":{"low_resolution":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s320x320/e35/12407205_1594417670778980_137398085_n.jpg","width":320,"height":320},"thumbnail":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s150x150/e35/12407205_1594417670778980_137398085_n.jpg","width":150,"height":150},"standard_resolution":{"url":"https://scontent.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/12407205_1594417670778980_137398085_n.jpg","width":640,"height":640}},
"users_in_photo":[
    {"position":{"y":0.708,"x":0.749333333},"user":{"username":"sarahlizbeth","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/11055503_632051400260313_1875680088_a.jpg","id":"7067991","full_name":""}}
    ..etc...
],
"caption":{"created_time":"1454427032","text":"Hiking",
"from":{"username":"jeff_quinn","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/10950568_1565864193682783_157122195_a.jpg","id":"7070844","full_name":"Jeff Quinn"},"id":"1176141660691444784"},
"user":{"username":"jeff_quinn","profile_picture":"https://scontent.cdninstagram.com/t51.2885-19/10950568_1565864193682783_157122195_a.jpg","id":"7070844","full_name":"Jeff Quinn"}}
}

I show 1,826,552 photos at present. It would not be overly difficult, to export the metadata in some format, then write a Python program to tease it apart and get at what we do have:

Still, this does not seem to include anything to further target the poster's home location.

gregallensworth commented 6 years ago

Given the user-ID, in theory one could brute-force the Instagram API to get at user details.

API output

https://www.instagram.com/developer/endpoints/users/#get_users

The API output seems not to include much personal info: their name, username, and profile picture which we already have in the photo metadata; URL of their website and their "bio" blurb.

API limits

I wouldn't know yet how many distinct users are represented in the 1.8M photos. Part of such a "user info scraper" would include a caching mechanism so as not to re-fetch the same username multiple times. Still, hitting the Instagram API 1 million times or even 250,000 times could be a violation of their TOU, as well as being time-consuming.

Per chat, Stamen had at some point connected users to coordinates or at least counties. Let's connect with them and see if they have any notes on that process, which could be helpful.

danrademacher commented 6 years ago

Here's the word from Stamen -- simpler and more impressionistic, but fine for this use case:

I don't know if I ever assigned "home" counties for users. Rather, we just looked at any users that showed up in parks within each county, and then looked at which coastal parks they also showed up in. So, if we're drawing the connection between Riverside county and a park in, say, Ventura, we just look for any username that shows up in both places. It's possible the user's "home" location is actually in Ventura and they just happened to visit Riverside once. Or it's possible their home is in Sacramento and they just happened to visit both Riverside and Ventura. In all those cases, they'd show up as a link, but we can't distinguish which scenario is which. Also, we were only using the corpus of photos that we harvested in parks, so a user would have to have visited a park in Riverside to even show up in the database. If they live in Riverside, but only ever visited parks in Ventura, we wouldn't show them as a link, because we have no evidence of them being in Riverside.

So in this case, the query would be users that appear in coastal_parks that also appear in parks that fall inside 39th District.

that seems much more straight forward.

gregallensworth commented 6 years ago

To rephrase for myself, the desired end result is:

This would be the count of distinct instagram userids, found among photos for that park superunit_id, where that same instagram userid is seen at least once in a photo for a park within Congdist39.

Also noted that this is likely to be replicated for other congdists.

Steps:

gregallensworth commented 6 years ago

I did the calculations as expected, and have placed them onto GreenInfo's internal file storage for reference. Let's discuss!