Conte-Ecology / conteStreamTemperature_web

Description and scripts for running the temperature model through the SHEDS web application
MIT License
0 stars 0 forks source link

Climate and Covariate Input Data #9

Closed walkerjeffd closed 9 years ago

walkerjeffd commented 9 years ago

I've been able to run through all of the scripts using example input RData files that Dan put on Dropbox (temperatureData, masterData, climateData, covariateData). So I wanted to see if we can start putting these datasets in the database.

Are the daymet and covariate datasets all finalized, or are you still working on them?

bletcher commented 9 years ago

Kyle has finished with the region-wide delineation and thinks the datasets should be ready next week. Would it make sense to work on a subset?

We have a new hard-drive for the server. Chris should be able to install it early next week prior to uploading the big datasets.

On Fri, Nov 21, 2014 at 7:58 AM, Jeff Walker notifications@github.com wrote:

I've been able to run through all of the scripts using example input RData files that Dan put on Dropbox (temperatureData, masterData, climateData, covariateData). So I wanted to see if we can start putting these datasets in the database.

Are the daymet and covariate datasets all finalized, or are you still working on them?

— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature_web/issues/9.

Silvio O. Conte Anadromous Fish Research Center, U.S. Geological Survey P.O. Box 796 -- One Migratory Way Turners Falls, MA 01376 (413) 863-3803 Cell: (413) 522-9417 FAX (413) 863-9810

ben_letcher@usgs.gov bletcher@eco.umass.edu http://www.lsc.usgs.gov/?q=cafb-research

walkerjeffd commented 9 years ago

Great. Yeah, starting with a subset would make things a bit easier.

Tom-Bombadil commented 9 years ago

Right now the climate data is up in the air, so to speak. Assigning a time series to each individual catchment quickly became very large, especially as we expand the range and the catchments get smaller. I am looking into producing a database of climate data averaged in each HUC12. This record would then get assigned to all of the catchments in that HUC. I am trying for early next week so we can make a decision. Then I will need to run it for the whole range and create the database. I think the format for this was a SQLite database with separate tables for each variable in long format with a unique catchment (or HUC) ID. I can change this format if you have a preference.

Assuming we go with the HUC averaged climate database, will it be acceptable to assign a climate record to the observed sites from this database? This will also have implications for the upstream average we were previously doing for precipitation. I’m not sure how we will want to handle this, particularly for the prediction sites.

The covariate data is nearly complete for the variables Dan said went into the model. The impounded area was done differently to make it able to be scripted. It’s in the form of a percentage instead of area, but can be easily converted using the drainage area. I will talk to Dan about how this goes into the model.

Spot checking of the observed temperature sites on the high resolution network needs to be done before assigning climate and more importantly covariate data. Is Chris’s tool ready for use? Will it need to account for different coordinate systems of source data?

The way I understand it right now is that I provide the climate database (either by HUC or by catchment) and the master set of covariates mapped to all of the catchment IDs. The observed temperature site location check and the pairing of climate and covariate data to temperature sites happens on the web system. Is this correct? It feels to me like there are still a few questions before things get finalized. Would it be helpful to have a call about all of this stuff?

The question also came up yesterday of aggregating hourly temperature data to daily, in regards to the raw Westbrook data. It is probably unrelated to this thread, but has anything been done on the web system for this?

walkerjeffd commented 9 years ago

Thanks for the update. We may want to have a call about this where I can explain how this will work on the web system. But I think I can simplify a good chunk of this for you.

First, both the climate and covariate data should not be linked to any temperature locations on your end. That join will be done by the database. We need to do it this way because when someone adds a new location in a catchment that previous didn't have any locations, then we don't want to have to re-run all of your scripts just to add one catchment/location. Similarly, it will make it easier to change any single location and its associated catchment.

The climate data can be spatially aggregated in the database directly. All that you really need to provide is the raw daymet data with each set of daily values tied to a latitude/longitude. The easiest thing would be a table with columns [date, latitude, longitude, dayl, srad, swe, tmin, tmax, prcp, vp]. We can then use the database to do a spatial join between latitude/longitude and each catchment, and then compute the average of each variable for each catchment. The database can probably do this a lot more quickly than in R or ArcPy. However, I'm not sure how easy it is to extract the latitude/longitude for each pixel of each tile. Did you do this already? Or do you use a GIS tool to do an intersection between a raster layer (the daymet tile) and a polygon vector layer (the catchments)? Does the netcdf have lat/lon for each point?

The covariate data will need to be pre-computed for all catchments and then we'll have a table in the database with columns [featureid, huc4, huc8, huc12, Forest, Herbacious, Agriculture, etc.]. Then the database will create a table for the model that links the locations to the catchments via featureid.

So we can get this all set up before doing spot checking of the locations. We do have a tool for confirming the locations in the web application now, but we can do that checking after we get the covariate and climate datasets set up. Does that make things more clear? I'll try to create a diagram that explains how this works on the database, which might help too.

Tom-Bombadil commented 9 years ago

Jeff, thanks for the explanation. Last time we talked about the climate data, I thought we had decided to put the aggregated version onto the web system. This is what I have been working on getting to Dan for his model. I don’t think I was a part of the conversation that decided to put the raw climate data on the system, though makes the most sense if the amount of data can be handled. I think the subtle differences between what is needed for the web system and what is needed for the temperature model papers are causing the confusion.

The climate data I am preparing for Dan is different from the climate data Jeff is talking about here. Maybe we should talk more regularly as a group to avoid this issue where two different versions of a task gets thought of as one. This might also help avoid duplicated efforts and unnecessary temporary products, because right now we potentially have two different, but very similar workflows for the climate data.

I’ll try to clarify everything regarding my work with the Daymet data:

The climate data is downloaded as a mosaic of the entire country in NetCDF format. This was changed from the tiles because it’s simpler to deal with as we expand the area of our analysis and because the former tile system is supposed to be discontinued.

The raw NetCDF files now consist of 1 variable over 1 year, per file. Raw file sizes range from 300 MB to 3 GB (280 GB total). Each point has a lat/lon associated with it, ~25,000,000 points for the whole country (~1,200,000 in New England/New York) per day.

The current script reads subsets of the NetCDF mosaic into R based on the spatial extent of some polygon (e.g. state or regional boundary). This is slow because the threshold that R can handle is relatively low compared to the size we’re dealing with. Then the values get spatially averaged over some polygon (e.g. catchment or HUC) and saved into a database with a unique ID. Doing this by catchment was huge which is why I’ve been looking into averaging by HUC. Ben or Dan, can you clarify if the aggregation still needs to happen in addition to the raw data going onto the web system? Is looking at the HUC averages still relevant?

I can alter the script to read the NetCDF and write each point directly into a database, 1 time series for each lat/lon. R is might not the best tool for this, but I haven’t looked into learning anything else. If we’re storing it on the web system, we’ll need to update this each year, usually around April or May. Is it possible to read the NetCDF directly into the web system? What region to do we want to do this over. For the current NENY region this will be 15 billion records per variable (365 days x 34 years x 1,200,000 points).

I am going to wait for clarification on what to do with the climate data before going any further. The covariate data is straightforward and should be ready for the new delineation over the entire northeast on Monday or Tuesday. What subset do we want to start with?

djhocking commented 9 years ago

I think we need to have a call early next week if possible. My schedule is pretty free M-W (or Friday).


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Nov 21, 2014, at 4:05 PM, Tom-Bombadil notifications@github.com<mailto:notifications@github.com> wrote:

Jeff, thanks for the explanation. Last time we talked about the climate data, I thought we had decided to put the aggregated version onto the web system. This is what I have been working on getting to Dan for his model. I don’t think I was a part of the conversation that decided to put the raw climate data on the system, though makes the most sense if the amount of data can be handled. I think the subtle differences between what is needed for the web system and what is needed for the temperature model papers are causing the confusion.

The climate data I am preparing for Dan is different from the climate data Jeff is talking about here. Maybe we should talk more regularly as a group to avoid this issue where two different versions of a task gets thought of as one. This might also help avoid duplicated efforts and unnecessary temporary products, because right now we potentially have two different, but very similar workflows for the climate data.

I’ll try to clarify everything regarding my work with the Daymet data:

The climate data is downloaded as a mosaic of the entire country in NetCDF format. This was changed from the tiles because it’s simpler to deal with as we expand the area of our analysis and because the former tile system is supposed to be discontinued.

The raw NetCDF files now consist of 1 variable over 1 year, per file. Raw file sizes range from 300 MB to 3 GB (280 GB total). Each point has a lat/lon associated with it, ~25,000,000 points for the whole country (~1,200,000 in New England/New York) per day.

The current script reads subsets of the NetCDF mosaic into R based on the spatial extent of some polygon (e.g. state or regional boundary). This is slow because the threshold that R can handle is relatively low compared to the size we’re dealing with. Then the values get spatially averaged over some polygon (e.g. catchment or HUC) and saved into a database with a unique ID. Doing this by catchment was huge which is why I’ve been looking into averaging by HUC. Ben or Dan, can you clarify if the aggregation still needs to happen in addition to the raw data going onto the web system? Is looking at the HUC averages still relevant?

I can alter the script to read the NetCDF and write each point directly into a database, 1 time series for each lat/lon. R is might not the best tool for this, but I haven’t looked into learning anything else. If we’re storing it on the web system, we’ll need to update this each year, usually around April or May. Is it possible to read the NetCDF directly into the web system? What region to do we want to do this over. For the current NENY region this will be 15 billion records per variable (365 days x 34 years x 1,200,000 points).

I am going to wait for clarification on what to do with the climate data before going any further. The covariate data is straightforward and should be ready for the new delineation over the entire northeast on Monday or Tuesday. What subset do we want to start with?

— Reply to this email directly or view it on GitHubhttps://github.com/Conte-Ecology/conteStreamTemperature_web/issues/9#issuecomment-64037572.

walkerjeffd commented 9 years ago

Yeah I agree, time for a call to regroup. I'm free any day/time next week.

After thinking about it more, I'm going to reverse what I said before, and it might actually be better to store the daymet data spatially aggregated by catchment than all of the raw data for each lat/lon. 15 billion records may exceed the abilities of felek. Either way, we definitely need all of the data for all of the catchments. But I don't see why we would need every lat/lon point. Part of this confusion is because we changed from MongoDB to PostgreSQL which has a lot of GIS functionality built in so we can do a lot more spatial things in the database directly.

Tom-Bombadil commented 9 years ago

A call sounds good. Want to shoot for Monday afternoon?

That makes sense. I forgot about the switch so it will be good to talk about that.

djhocking commented 9 years ago

I think additional confusion results from the prediction and display sides of things. We talked before and “decided” that it might not be feasible to do daily predictions for every catchment in the northeast and that it would also be difficult/impossible to display results for every catchment. We decided to aggregate to the HUC12 level. However, I’m not clear where/when/if that should happen. Is that just for predictions (i.e. the model would be run for each catchment then predictions would be done for just downstream HUC12 catchments and all upstream catchments would also get those values by default or the predictions would be done using the covariates averaged over the HUC12)? Is that only for the web app and not for the papers? Is it really just for purposes of displaying the full northeastern map?

In reality what will be shown on the whole map? Just the derived parameters I assume. That seems like it could be done by catchment, but would have to be done in chunks (i.e. run the model at the catchment level for all sites with data, then do predictions for a subset of the catchments within the region over the entire day met record, calculate the derived metrics for those sites and output the table of metrics, then repeat the predictions for the next subset of sites and append them to the table/database, and repeat until the entire region is covered).

Then to show time series, we would do it just when the user drills down to a specific catchment (or HUC12 ???) and that could be done on the client side so the daily predictions don’t have to be stored for every day for every site for the entire region. Does that make sense? Am I missing something?

These don’t have to be answered via GitHub right now but I wanted to have it down for next week’s discussion.


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Nov 21, 2014, at 4:20 PM, Jeff Walker notifications@github.com<mailto:notifications@github.com> wrote:

Yeah I agree, time for a call to regroup. I'm free any day/time next week.

After thinking about it more, I'm going to reverse what I said before, and it might actually be better to store the daymet data spatially aggregated by catchment than all of the raw data for each lat/lon. 15 billion records may exceed the abilities of felek. Either way, we definitely need all of the data for all of the catchments. But I don't see why we would need every lat/lon point. Part of this confusion is because we changed from MongoDB to PostgreSQL which has a lot of GIS functionality built in so we can do a lot more spatial things in the database directly.

— Reply to this email directly or view it on GitHubhttps://github.com/Conte-Ecology/conteStreamTemperature_web/issues/9#issuecomment-64039482.

walkerjeffd commented 9 years ago

Monday afternoon works for me.

djhocking commented 9 years ago

Let’s plan a call at 3pm today.


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Nov 22, 2014, at 9:27 AM, Jeff Walker notifications@github.com<mailto:notifications@github.com> wrote:

Monday afternoon works for me.

— Reply to this email directly or view it on GitHubhttps://github.com/Conte-Ecology/conteStreamTemperature_web/issues/9#issuecomment-64081840.

walkerjeffd commented 9 years ago

Works for me. Hangout?

djhocking commented 9 years ago

We can try a Google Hangout and be prepared to switch to the phone if the conte internet is too slow today.


Daniel J. Hocking

dhocking@unh.edumailto:dhocking@unh.edu http://danieljhocking.wordpress.com/


On Nov 24, 2014, at 12:13 PM, Jeff Walker notifications@github.com<mailto:notifications@github.com> wrote:

Works for me. Hangout?

— Reply to this email directly or view it on GitHubhttps://github.com/Conte-Ecology/conteStreamTemperature_web/issues/9#issuecomment-64227717.