Add additional location sources

CloudNiner commented 5 years ago

Overview

Based on prior calculations we're limited to ~15k new cells between the two datasets if we assume linear scaling of database storage required. So, the original plan of using the NLCD "developed land" category is out. That category includes roads, which explodes the number of cells required to ingest for very little value upwards of 100k cells.

So instead we pivoted to two new datasources, the GeoNames place dataset and the OurAirports global airport list. For the US, we're choosing to ingest points for each of the small|medium|large airports -- about 13k cells. Airports are more uniformly distributed across the country while remaining close enough to population centers relative to the resolution of the source climate data. People are generally already familiar with using weather data and forecasts from their local airport so this shouldn't be a cognitive stretch either. For the EU we chose to ingest all population centers with > 100k population as an initial proxy for semi uniform coverage to demo EU availability of data.This filter provides us with about 500 locations putting us close to the 15k cell limit with a bit of buffer.

This PR does a few things:

Adds code in location_sources to parse each of these new datasets into ClimateLocation tuples
Adds code to the debug file writer there to also output geojson
Adds an --import-geojson-url option to the nex2db importer to pair with the shapefile importer. This new option adds support for ingesting data based on Point FeatureCollections.

Demo

Points for ingest

US Airports

EU Cities screen shot 2018-12-07 at 9 57 06 am

Notes

A nice side effect of the methods in climate_data.nex2db.location_sources is that they can be used to both generate input files for the nex2db importer and by the importer to parse sources. The new Geonames and OurAirports sources were used to generate the geojson parsed by the GeoJsonUrlLocationSource powering the --import-geojson-url nex2db option.

I generated the us airports and EU cities geojson files and they're in our climate sandbox bucket on S3:

Testing Instructions

Easiest way to test the new files is to run a single local ingest for each of them. Since you're only pulling in one year + var + model + scenario the ingest remains approx constant time as the number of locations increases.

To do an ingest with each file:

./scripts/update
./scripts/server
# In another terminal
./scripts/manage create_jobs --import-geojson-url "https://s3.amazonaws.com/azavea-climate-sandbox/airports_us_all.geojson" LOCA RCP85 CCSM4 2050
./scripts/manage create_jobs --import-geojson-url "https://s3.amazonaws.com/azavea-climate-sandbox/geonames_cities_100000pop_eu.geojson" NEX-GDDP RCP85 CCSM4 2050
./scripts/manage run_jobs

Once the ingest is complete, you should be able to query the data you've ingested for any of the locations in either file.

Steamboat Springs CO:

http :8080/api/climate-data/40.516280/-106.866007/RCP85/ Authorization:"Token <yourtoken>" dataset==LOCA

Amsterdam, Netherlands:

http :8080/api/climate-data/52.369080/4.898022/RCP85/ Authorization:"Token <yourtoken>" dataset==NEX-GDDP

Checklist

[x] Does the python linter pass?
[x] Do tests pass?
[ ] ~~Has the API documentation been updated, or does this PR not require it?~~

Connects #840

flibbertigibbet commented 5 years ago

After bumping the VM memory to 8GB and re-running, both jobs completed successfully.

There was one warning in the output (converting a masked element to nan), but no errors.

Starting job processing...
Processing SQS message for model CCSM4 scenario RCP85 year 2050
Writing debug locations shapefile to path: /opt/django/climate_change_api/nex2db-locations-debug/bd4da6f0-d395-4cb9-8160-73b52d2231e0.shp
Features written to: /opt/django/climate_change_api/nex2db-locations-debug/bd4da6f0-d395-4cb9-8160-73b52d2231e0.shp
Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/tasmax/tasmax_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
^[Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/pr/pr_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/tasmin/tasmin_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
/usr/local/lib/python3.5/site-packages/django/db/models/fields/__init__.py:1760: UserWarning: Warning: converting a masked element to nan.
  return float(value)
ClimateDataCityCell update SKIPPED for 6. City wasn't modified.
ClimateDataCityCell update SKIPPED for 7. City wasn't modified.
ClimateDataCityCell update SKIPPED for 9. City wasn't modified.
nex2db processing done

CloudNiner commented 5 years ago

bumping the VM memory to 8GB and re-running, both jobs completed successfully

We had done some work awhile back to bring the memory requirements for an ingest down. If this is running out of memory it means we've broken that optimization somewhere. Not great. I'll open a new issue for that which might be worth addressing if its straightforward to track down.

Staging and production both have a custom management command task that allocates 8Gb of memory instead of the usual 1-2gb so we're fine for those environments.

flibbertigibbet commented 5 years ago

The Amsterdam query succeeds, but the query for Steamboat Springs, CO, returns:

{"detail":"No NEX-GDDP data available for point (40.516280, -106.866007)"}

CloudNiner commented 5 years ago

Looks like it might not be straightforward to track down the memory regression issue now since the original optimization remains in the code -- it would take a bit of debugging work to find the new problem and we already have the workaround ECS tasks. I added a comment linking this PR to the original memory issue.

CloudNiner commented 5 years ago

The Amsterdam query succeeds, but the query for Steamboat Springs, CO, returns

That is correct. Assuming you only ran the commands I showed above, you only imported LOCA data for the US airport locations. If you query for LOCA data instead, you should get results.

flibbertigibbet commented 5 years ago

There's no LOCA scenario in the default local setup, it seems.

Doesn't the US ingest command above include RCP85? That's also what is in the test query above.

CloudNiner commented 5 years ago

There's no LOCA scenario in the default local setup, it seems.

That's correct, but there should be some once you run the create_jobs + run_jobs commands above. Both the create jobs command and the test http calls reference RCP85. You can see if you have any ClimateDataSource objects loaded for RCP85 and LOCA by running the following in a Django shell_plus:

loca = ClimateDataset.objects.get(name='LOCA')
rcp85 = Scenario.objects.get(name='RCP85')
ClimateDataSource.objects.filter(dataset=loca, scenario=rcp85).count()

On my box, in which I rebuilt the VM then ran the create_jobs commands above, the result of the three Django shell lines above is 3040 rows.

If you're using something other than HTTPie to make your test HTTP queries, make sure you're including the dataset=LOCA http GET param.

It's possible that the OOM killed jobs messed something up in the database state. As a drastic measure, you could delete all ClimateDataSource objects from your local VM and re-run the create jobs + run jobs commands above.

I'd be interested in seeing the exact series of commands you ran and what the output was if the above doesn't help. Trying to get at whether there's some bug in the importer that could cause trouble on a larger production import.

flibbertigibbet commented 5 years ago

I also have 3040 records matching the above shell filter.

I was able to get results by modifying the query to:

curl -i "http://localhost:8080/api/climate-data/40.516280/-106.866007/RCP85/?dataset=LOCA" -H "Authorization: Token <TOKEN>"

azavea / climate-change-api