UrbanCCD-UChicago / plenario

API for geospatial and time aggregation across multiple open datasets.
http://plenar.io
MIT License
154 stars 43 forks source link

Use cases for weather data #29

Closed jcgiuffrida closed 10 years ago

jcgiuffrida commented 10 years ago

Here is some detail about how @meemking and I envision using weather data in Plenario.

Filter by other datasets

Perhaps the most important use case is the ability to use weather data to filter observations in other datasets - e.g., if the user only cares about murders that occurred when the sun was shining and it was at least 95 degrees out. One way this could be implemented could be to designate which fields in a dataset could be filtered on, perhaps using a flag on the column when the dataset is first imported, and give the user the ability to add filters on the /explore page.

For instance, imagine that under the "Aggregate by:" select field there was a button to add a filter, which then displays a select field to select a database, which then displays a select field to choose an attribute from that database, among a pre-selected list of attributes. Then it displays buttons or fields to let the user say things like "=cloudy", ">=95", "!=0". The user could then add more filters in the same way, which would be joined together under a single WHERE ... AND ... clause. The final query would filter out points whose nearest weather observation (in space and time) does not meet those conditions.

Obviously, this can only be done with datasets that are guaranteed to have complete spatial and temporal coverage, which is where Matt's imputation script can be particularly helpful.

It would also be great to have a "day/night" filter which could even use dawn/dusk time information from weather.

Further functions and use cases below, which may be things we can start implementing sooner without major overhauls.

Functionality

Feel free to use this as a starting point, let us know what will be easy/difficult to implement, and what questions you have. This is only meant to get us to think more broadly about how to incorporate sensor data and how to perform queries on multiple tables.

evz commented 10 years ago

OK, I'm going to work with the Quality Controlled Local Climatological Data from here: http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N which has a couple limitations but I think will be useful for us to get started with. The limitations are the data only goes back as far as July 1996 and is about 2-3 days behind the current date. The main reason to start with this is that I've already figure out how to parse and load it into our data structure.

One huge question that is outstanding is how we are actually going to calculate what weather observation is the one that we will use for a given time and place. I can use that voronoi thing that I came up with a while back but in order for it to be historically accurate, we're going to need to recalculate it for every query since not all weather stations will have observations for every query. I'm not sure how computationally expensive this is going to be yet but it's something that we'll need to consider since it sounds like we'll want to support historical data as far back as we can. However, I'm also not sure where we can get hourly weather observations going back farther than those that we get from the QCLCD data that I pointed out above so maybe this will end up being a moot point.

fgregg commented 10 years ago

Why will every weather station not have a observation for every query?

Because weather stations are set up and closed? Or because weather stations have interruptions in their service? Or because weather stations have different reporting frequencies?

Or some other reason?

On Wed, Aug 6, 2014 at 10:21 AM, Eric van Zanten notifications@github.com wrote:

OK, I'm going to work with the Quality Controlled Local Climatological Data from here: http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N which has a couple limitations but I think will be useful for us to get started with. The limitations are the data only goes back as far as July 1996 and is about 2-3 days behind the current date. The main reason to start with this is that I've already figure out how to parse and load it into our data structure.

One huge question that is outstanding is how we are actually going to calculate what weather observation is the one that we will use for a given time and place. I can use that voronoi thing https://github.com/datamade/plenario-ops/issues/14#issuecomment-46056715 that I came up with a while back but in order for it to be historically accurate, we're going to need to recalculate it for every query since not all weather stations will have observations for every query. I'm not sure how computationally expensive this is going to be yet but it's something that we'll need to consider since it sounds like we'll want to support historical data as far back as we can. However, I'm also not sure where we can get hourly weather observations going back farther than those that we get from the QCLCD data that I pointed out above so maybe this will end up being a moot point.

— Reply to this email directly or view it on GitHub https://github.com/datamade/plenario/issues/29#issuecomment-51349875.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

jcgiuffrida commented 10 years ago

Are there any coverage shapefiles available from the NOAA website or elsewhere? The spatial aggregation issue may be a moot point entirely if it's already been solved. Voronoi may be overkill - we just need to assign every point to a weather station, and there may be shapefiles that already do this. (Voronoi also ignores geographical features, like elevation and mountain ranges, that could drastically affect weather.)

Just a thought in case anyone has looked into this before.

jcgiuffrida commented 10 years ago

And to @fgregg's point, is the main issue that some weather stations go offline for days or weeks at a time, or that every so often an observation gets lost? Those two problems might require very different solutions.

evz commented 10 years ago

For most of the NCDC products, they give you a identifier in the form of either a USAF code or a WBAN code (which is a code that was adopted in the 1950s by the various agencies that collect weather info in the US). The "master list" is here: http://www.ncdc.noaa.gov/homr/reports/platforms but that only includes weather stations that are currently operational. There is a historical file that you can get via FTP here: http://www1.ncdc.noaa.gov/pub/data/noaa/ish-history.csv.

The only reason I brought this up is because, at least for the historical file, there are stations that were taken out of service at some point (and one would assume you would not see any more observations after that date). So, if you perform a query that could include data from a station that was in service during that time, it would make sense to be able to get that data along with the other data.

I understand that there are limitations with the voronoi calculation but I haven't seen any shapefiles that give the "effective area" for a weather observation. I'm pretty sure they don't really care much, especially because weather predictions are made using radar of what's happening right now (or playbacks of what happened yesterday, etc) and the weatherman on TV just puts the numbers on a map and its up to the observer to interpret what that means. The problem only arises when you're attempting a spatial join with another dataset that has point level data.

I'm hoping that we can get a dense enough coverage, at least for the US, so that topography won't really play too much of a role in skewing that observations. If you look at the map that I generated just for Chicago and vicinity, it's already pretty dense.

fgregg commented 10 years ago

I would recommend the following.

Later, more sophisticated interpolations can be used, if desired.

On Wed, Aug 6, 2014 at 10:50 AM, Eric van Zanten notifications@github.com wrote:

For most of the NCDC products, they give you a identifier in the form of either a USAF code or a WBAN code (which is a code that was adopted in the 1950s by the various agencies that collect weather info in the US). The "master list" is here: http://www.ncdc.noaa.gov/homr/reports/platforms but that only includes weather stations that are currently operational. There is a historical file that you can get via FTP here: http://www1.ncdc.noaa.gov/pub/data/noaa/ish-history.csv.

The only reason I brought this up is because, at least for the historical file, there are stations that were taken out of service at some point (and one would assume you would not see any more observations after that date). So, if you perform a query that could include data from a station that was in service during that time, it would make sense to be able to get that data along with the other data.

I understand that there are limitations with the voronoi calculation but I haven't seen any shapefiles that give the "effective area" for a weather observation. I'm pretty sure they don't really care much, especially because weather predictions are made using radar of what's happening right now (or playbacks of what happened yesterday, etc) and the weatherman on TV just puts the numbers on a map and its up to the observer to interpret what that means. The problem only arises when you're attempting a spatial join with another dataset that has point level data.

I'm hoping that we can get a dense enough coverage, at least for the US, so that topography won't really play too much of a role in skewing that observations. If you look at the map http://bl.ocks.org/d/d2797efb1c85170a9751 that I generated just for Chicago and vicinity, it's already pretty dense.

— Reply to this email directly or view it on GitHub https://github.com/datamade/plenario/issues/29#issuecomment-51354187.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

evz commented 10 years ago

OK, I'm working this out in a feature branch here: https://github.com/datamade/plenario/tree/weather_redux

Now that I have weather stations, I can start working out how to apply a grid, etc as @fgregg suggested above. I'll see if I can tackle that this afternoon.

jcgiuffrida commented 10 years ago

Here's how Brett wants us to think about it, which is a little backwards from my post above. Apologies for the length, but thought it best to get the message out now rather than next Tuesday.

Say the user wants to look at a slice in space and time - like potholes on Lake Shore Drive this past weekend. Rather than thinking about this as adding one dataset to another, remember they're all already part of the same urban environment. So we ask Plenario what intersects this slice - what can help us color this picture of potholes on LSD during Lolla. (No pun intended.) There will be weather attributes that occurred over the same spatial and temporal indices, as well as things like crimes, census attributes, and more relevantly (far in the future), traffic, accident reports, and tweets. So in selecting potholes, we also pull in a ton of other attributes from other datasets that are related by those spatial and temporal indices.

It's sort of a mindset that I think we've been filling out for a while - the core of Plenario is not the PostgreSQL database that sucks in all this open data, but the fact that it's all united by a single spatial index and a single temporal index, so the user is encouraged to look at a slice in space and time holistically, rather than by constructing SQL filters and queries on the fly.

How this works in practice is something we need to figure out. What do we display to the user? How do they download the data?

For instance, maybe once they say they're interested in potholes on LSD during Lolla, we look at what datasets intersect (have data over) that polygon over that time period and present it all in zipped csvs. If a dataset like weather doesn't have full coverage there, so be it - we can offer to impute the missing values or just present them as-is. But we want the user to see potholes as one aspect of a rich urban environment, not as one side of an equation whose other side is just weather. That can imply misleading causality, and doesn't do justice to the hard work we've put into this platform.

The current functionality is great - it tells the story of an area in space and time. What we need to add to that, according to Brett, is the ability to start with one dataset, like potholes, make that spatial and temporal selection, and then see what else helps fill in that story. So it's both "Tell me the story of LSD during Lolla" and "Tell me the story of potholes on LSD during Lolla." Again, apologies for the pun.

@meemking, please fill in with your takeaways from this morning if I'm not making sense.

fgregg commented 10 years ago

Right now, the user can select an area, time resolution, and time window and get a variety of times series for that area. That is an intersection of data in time and space.

Are you talking about something different.

On Wed, Aug 6, 2014 at 3:07 PM, Jonathan Giuffrida <notifications@github.com

wrote:

Here's how Brett wants us to think about it, which is a little backwards from my post above. Apologies for the length, but thought it best to get the message out now rather than next Tuesday.

Say the user wants to look at a slice in space and time - like potholes on Lake Shore Drive this past weekend. Rather than thinking about this as adding one dataset to another, remember they're all already part of the same urban environment. So we ask Plenario what intersects this slice - what can help us color this picture of potholes on LSD during Lolla. (No pun intended.) There will be weather attributes that occurred over the same spatial and temporal indices, as well as things like crimes, census attributes, and more relevantly (far in the future), traffic, accident reports, and tweets. So in selecting potholes, we also pull in a ton of other attributes from other datasets that are related by those spatial and temporal indices.

It's sort of a mindset that I think we've been filling out for a while - the core of Plenario is not the PostgreSQL database that sucks in all this open data, but the fact that it's all united by a single spatial index and a single temporal index, so the user is encouraged to look at a slice in space and time holistically, rather than by constructing SQL filters and queries on the fly.

How this works in practice is something we need to figure out. What do we display to the user? How do they download the data?

For instance, maybe once they say they're interested in potholes on LSD during Lolla, we look at what datasets intersect (have data over) that polygon over that time period and present it all in zipped csvs. If a dataset like weather doesn't have full coverage there, so be it - we can offer to impute the missing values or just present them as-is. But we want the user to see potholes as one aspect of a rich urban environment, not as one side of an equation whose other side is just weather. That can imply misleading causality, and doesn't do justice to the hard work we've put into this platform.

The current functionality is great - it tells the story of an area in space and time. What we need to add to that, according to Brett, is the ability to start with one dataset, like potholes, make that spatial and temporal selection, and then see what else helps fill in that story. So it's both "Tell me the story of LSD during Lolla" and "Tell me the story of potholes on LSD during Lolla." Again, apologies for the pun.

@meemking https://github.com/meemking, please fill in with your takeaways from this morning if I'm not making sense.

— Reply to this email directly or view it on GitHub https://github.com/datamade/plenario/issues/29#issuecomment-51389545.

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

fgregg commented 10 years ago

@evz, if we are not filtering, then we probably don't need to do create the grid. We can calculate time series on the fly for arbitrary polygons using that nearest neighbor query.

jcgiuffrida commented 10 years ago

@fgregg, yes, and that functionality is great. What would also be good is a way to then zero in on a certain dataset, like the "View" buttons already allow us to do, and see what datasets intersect those particular values in that dataset - and be able to download and/or display relevant attributes. So you make your initial spatio-temporal query, you click "View" on Potholes, you see the heatmap, and it gives you an option to download the Potholes point data with attributes appended from other datasets - attributes that perhaps could be left to the user's discretion, like summary statistics on other datasets that intersect that spatio-temporal query. For instance, we could append to each pothole row a value for the temperature at the time and place that the pothole was reported, and values for the mean temperature and total precipitation over the past week in that location. And also a count of car crashes within a certain spatial and temporal radius of that pothole report, etc.

This is one specific use case (which, btw, would be incredibly insightful in terms of helping us predict pothole formation) but the methods should be extensible to basically all the datasets we have.

I realize this could be a lot of work to set up, but we think the implications of this functionality are huge. It will only get more insightful when we add shapefile-linked data. What do you all think?

Furthermore, I've pinged the NCDC to see if any weather coverage shapefiles exist and will let you all know if they do. It doesn't seem likely.

fgregg commented 10 years ago

Sounds likes issues #4, and #21.

meemking commented 10 years ago

Just want to add that with any method for weather interpolation, I think we also have to take into account elevation. This isn't really relevant in Chicago, but probably is really relevant along all weather variables once we leave the midwest.

On Wed, Aug 6, 2014 at 4:11 PM, Forest Gregg notifications@github.com wrote:

Sounds likes issues #4 https://github.com/datamade/plenario/issues/4, and #21 https://github.com/datamade/plenario/issues/21.

— Reply to this email directly or view it on GitHub https://github.com/datamade/plenario/issues/29#issuecomment-51397788.

Maggie King email: Maggiek@uchicago.edu phone: (574) 286-9389

jcgiuffrida commented 10 years ago

Another use case: when we launch, we want to use the following as an example query:

Give me all the potholes and weather on Lake Shore Drive in April 2014. Query: Lake Shore Drive, April 2014, weather + 311 potholes

The approximate spatial/temporal parameters are those recorded in this query.

jcgiuffrida commented 10 years ago

The most important thing for Alpha is to be able to return weather attributes (even if just the core QCLCD ones) to other selected points for downloaded data. For instance, we could add an option to download potholes/crimes/other point-level data with the closest weather attributes in time and space in additional columns. Everything else listed above can come after Alpha unless it would be easy to implement.

derekeder commented 10 years ago

weather data and API has been implemented in #60
closing and moving this discussion over to #93