codeforokc / school-finder

Geolocation-based web app for locating schools and school districts near you
MIT License
18 stars 17 forks source link

Using mapbox for school data querying #6

Closed joekarl closed 9 years ago

joekarl commented 9 years ago

So after looking into what we can do with mapbox directly, here's what I have

Perhaps @jvrousseau or @DevinClark can chime in with thoughts.

jvrousseau commented 9 years ago

@joekarl upload of mbtiles or via MapBox studio does not require the standard plan.

joekarl commented 9 years ago

Right but as far as I can tell using the api requires the standard plan.

Karl

Sent from my iPhone

On Dec 22, 2014, at 7:55 PM, Jordan Rousseau notifications@github.com wrote:

@joekarl https://github.com/joekarl upload of mbtiles or via MapBox studio does not require the standard plan.

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67914219 .

jvrousseau commented 9 years ago

@joekarl if we query the dataset we can use turfjs to find the nearest location (if we are talking about proximity to any schools). We can also query what polygon a user is in to determine where a child would feed into based on their location.

joekarl commented 9 years ago

So we'd still have to store the geojson somewhere for turf to get to it correct?

Sent from my iPhone

On Dec 22, 2014, at 7:57 PM, Jordan Rousseau notifications@github.com wrote:

@joekarl https://github.com/joekarl if we query the dataset we can use turfjs to find the nearest location (if we are talking about proximity to any schools). We can also query what polygon a user is in to determine where a child would feed into based on their location.

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67914367 .

jvrousseau commented 9 years ago

You wouldn't be able to get point data from surface API?

Sent from my iPhone

On Dec 22, 2014, at 8:05 PM, Karl Kirch notifications@github.com wrote:

So we'd still have to store the geojson somewhere for turf to get to it correct?

Sent from my iPhone

On Dec 22, 2014, at 7:57 PM, Jordan Rousseau notifications@github.com wrote:

@joekarl https://github.com/joekarl if we query the dataset we can use turfjs to find the nearest location (if we are talking about proximity to any schools). We can also query what polygon a user is in to determine where a child would feed into based on their location.

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67914367 . — Reply to this email directly or view it on GitHub.

joekarl commented 9 years ago

That just gives you data at a single point. Won't tell you about data around a point (which would be needed when outside a specific area).

Sent from my iPhone

On Dec 22, 2014, at 8:37 PM, Jordan Rousseau notifications@github.com wrote:

You wouldn't be able to get point data from surface API?

Sent from my iPhone

On Dec 22, 2014, at 8:05 PM, Karl Kirch notifications@github.com wrote:

So we'd still have to store the geojson somewhere for turf to get to it correct?

Sent from my iPhone

On Dec 22, 2014, at 7:57 PM, Jordan Rousseau notifications@github.com wrote:

@joekarl https://github.com/joekarl if we query the dataset we can use turfjs to find the nearest location (if we are talking about proximity to any schools). We can also query what polygon a user is in to determine where a child would feed into based on their location.

— Reply to this email directly or view it on GitHub < https://github.com/codeforokc/school-finder/issues/6#issuecomment-67914367> . — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67916319 .

jvrousseau commented 9 years ago

@joekarl yup. So only polyline/polygon data is queryable. Correct me if I'm wrong, but do we really need to query the points? Don't we need to query what schools are associated with the specific location?

ie my child goes to elementary school a based the location of the user on the map

joekarl commented 9 years ago

Right but that data isn't clean. When you're looking for the nearest school and you're just outside a district you're screwed. Also going forward plan is to include fire/police/library which is just point data (I believe).

Sent from my iPhone

On Dec 22, 2014, at 8:56 PM, Jordan Rousseau notifications@github.com wrote:

@joekarl https://github.com/joekarl yup. So only polyline/polygon data is queryable. Correct me if I'm wrong, but do we really need to query the points? Don't we need to query what schools are associated with the specific location?

ie my child goes to elementary school a based the location of the user on the map

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67917208 .

jvrousseau commented 9 years ago

Yes data needs to be cleaned. In fact a new dataset needs to be created for inner district boundaries like http://www.mooreschools.com/Page/35446 so we can determine what school they go to...if you are just outside a district the closest school by distance would not do you any good.

I think other public service buildings are a different story. Maybe geojson files stored on the GitHub project?

joekarl commented 9 years ago

Well the other thing in all of this is trying to cut down any manual step. Data should be able to be grabbed from the city website and stored someplace the app can get to it automatically. That way it can be updated weekly/monthly without having to have a person be involved. I'm cool with storing static geojson up in s3 or something and if we can automate the mapbox process I'm cool with that as well. Just need to avoid any step that requires manual cleaning or manual data upload.

Also need to think about cost/maintenance. We're already looking at having a webserver to automate data ingest and for potentially serving the app. Perhaps a db isn't a terrible thing to add in the mix and postgis would solve most of the things we're talking about here...

Sent from my iPhone

On Dec 22, 2014, at 9:14 PM, Jordan Rousseau notifications@github.com wrote:

Yes data needs to be cleaned. In fact a new dataset needs to be created for inner district boundaries like http://www.mooreschools.com/Page/35446 so we can determine what school they go to...if you are just outside a district the closest school by distance would not do you any good.

I think other public service buildings are a different story. Maybe geojson files stored on the GitHub project?

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67918079 .

jvrousseau commented 9 years ago

So the github-pages/MapBox suggestion revolved around not having a monthly cost vs having a application/static files/ingest/database server along with that maintenance. Yes, a postgis database would do most of what we need, but at a cost, right?

With regards to the geojson, I'm pretty sure s3 is a bit overkill. You can store/serve those through a static github page.

Overall, I would think a bit of manual updating/cleaning up of some data would be much more manageable than having a cost for a webserver in perpetuity...

joekarl commented 9 years ago

Yeah there's definitely some management involved if we go the traditional route (server + db) as well as cost. But those things are going to exist anyways. There has to be some sense of infrastructure/monitoring for this stuff as well. Not to mention costs of hostname(s) and other ongoing costs. That's something we have to live with.

With heroku the webserver cost is 0 so not super concerned there.

The big question on running a db is how much will it cost (the maintenance should be easy on RDS). With the AWS free tier you can get a single RDS instance for free per month (750 hours) which is significantly cheaper than what heroku offers.

So still looking at 0 to very minimal cost to run this thing.

Again I'm not opposed to doing everything static, but if we can't update the data in an automated fashion, we'll be in trouble for maintenance long term. With these projects we have to assume high churn on people as all the time is voluntary. Anything that is a manual step is asking for things to be lost in translation.

+1 for serving static files through gh pages though.

//cc @jagthedrummer

Sent from my iPhone

On Dec 22, 2014, at 11:56 PM, Jordan Rousseau notifications@github.com wrote:

So the github-pages/MapBox suggestion revolved around not having a monthly cost vs having a application/static files/ingest/database server along with that maintenance. Yes, a postgis database would do most of what we need, but at a cost, right?

With regards to the geojson, I'm pretty sure s3 is a bit overkill. You can store/serve those through a static github page.

Overall, I would think a bit of manual updating/cleaning up of some data would be much more manageable than having a cost for a webserver in perpetuity...

— Reply to this email directly or view it on GitHub https://github.com/codeforokc/school-finder/issues/6#issuecomment-67924258 .

joekarl commented 9 years ago

Also there's a good possibility we can get some free db love from heroku (may be mid January though).

jagthedrummer commented 9 years ago

Yeah, the ongoing cost vs ongoing effort tradeoff is definitely something we want to be aware of. I don't know where the right balance is, but I suspect it would be easier to get several people to donate a small amount of money each month to keep things running than it would be to find a few people who want to be on-call for supporting the infrastructure. (That may be me assuming that my own personal preference is what others would want. So....)

jvrousseau commented 9 years ago

@joekarl As I said before, I'm up for whatever and postgis would make things relatively easy when creating direct interfaces into the data, so this is definitely not a right/wrong answer. That being said, I think it's a mistake to just assume we have to have a heroku and/or aws presence.

@jagthedrummer what infrastructure would we need to support? I imagine the only maintenance (outside of framework/web app changes) anyone would have to make is updating the geo data (either geojson via github, uploading to mapbox, or updating OSM).

Not trying to be difficult here, just playing a little devils advocate...

joekarl commented 9 years ago

@jvrousseau I don't think we have to have an actual server, but if things like data ingest are going to be automated, then it's gotta run somewhere. I just don't like the idea of having to kick off the data ingest manually.

But setting the actual data ingest aside (there's another issue for that), let's focus on the actual querying of the data. From our conversations I think we're looking at three options (and correct me if I'm wrong).

1) Query all data from the client, have it all stored in Mapbox

2) Query all data from the client, have it all stored in GH pages or S3 or somewhere

3) Query all data on the server, accessed via API, have it all stored in Postgis


To me the first option seems to have the most unknowns. The Surface API is in private beta. The Geocoding API is in public beta. Will we have to store the entire dataset on our end as geojson for Turfjs to be able to do the distance querying? (if Mapbox can do this for us great but I don't see it). Also how do we update the Mapbox map without having someone pull down Mapbox Studio and manually do it? That being said, it requires the least infrastructure. Mapbox will handle

The second option is really really good. Doesn't take any server infrastructure, the data is static can be updated programmatically (though updating a git repo programmatically isn't fun, I've done it before, updating S3 is much easier).

Option three is the "traditional" approach with an API and database. It's kind of a fallback. This does incur cost with running infrastructure and we have to maintain said infrastructure.


So for me, options 2 and 3 are the viable ones, there's just too much unknown with trying to build this on top of option 1 because core features that we need are in beta and the Surface API isn't even public.

I would vote for shooting for option 2 with option 3 as a fallback.

jagthedrummer commented 9 years ago

@jvrousseau Well, in my experience just having a box online ends up involving a lot of "unexpected" maintenance here and there. Things like Heartbleed pop up and require immediate attention. Or if it's a postgres/postgis server maybe a security vuln is discovered there. Or the drive in the box fails. Or whatever. I just know that personally I've spend orders of magnitude less time dealing with these types of things since I've moved all of my personal projects to the heroku/hosted-service combo.

mkchandler commented 9 years ago

Thanks for putting together all those options @joekarl. I also like the second and third options the best.

Another thing to keep in mind is the extra data that we would like to store along with schools or other locations in the future. While I don't want to let in too much scope creep, we pretty much know we want to store some extra data.

Also, will it be easy to add additional points like city ward districts, fire/police stations and libraries in the future? Would they just get added to the same geojson file (or whatever format?) And would that start to be a penalty if we were querying that large of a file from the client every time? Most usage I would expect will be from mobile devices on 4G/LTE.

jagthedrummer commented 9 years ago

I'm also leaning towards 2 or 3. And for the sake of simplicity in getting started I'm leaning towards 2.

That would let us bypass any type of server side requirements for the moment and focus on getting a useable interface working. Just doing that will help a lot of us understand the problem space better, and be more equipped to make reasonable decisions as we move down the road. I think it's safe to assume that the school district data, in whatever format we need to transform it to, will be less than ~400k (hopefully that's way high). I can live with that to start. If we add two or three more data sets that are the same size we'll probably want to revisit this decision and consider having a hefty server component.

So, if we go with 2, then the "data ingest" piece becomes "download a file from data.okc.gov into the repo" or "download the file and transform it to something (geojson?)", yeah?

jvrousseau commented 9 years ago

great input @joekarl. Thanks a lot. You're absolutely right about the unknowns for number 1.

We are probably pretty good for quite a while until we hit a fairly large geojson file. We could split up the files if we add other datasets so we don't initially load a monolithic geojson file.

@jagthedrummer +1

mkchandler commented 9 years ago

I say we go with option 2 then! @joekarl has already started an issue to track the upload/transform process.

joekarl commented 9 years ago

Cool cool, I think we got it hashed out, closing.