Closed CloudNiner closed 5 years ago
After bumping the VM memory to 8GB and re-running, both jobs completed successfully.
There was one warning in the output (converting a masked element to nan
), but no errors.
Starting job processing...
Processing SQS message for model CCSM4 scenario RCP85 year 2050
Writing debug locations shapefile to path: /opt/django/climate_change_api/nex2db-locations-debug/bd4da6f0-d395-4cb9-8160-73b52d2231e0.shp
Features written to: /opt/django/climate_change_api/nex2db-locations-debug/bd4da6f0-d395-4cb9-8160-73b52d2231e0.shp
Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/tasmax/tasmax_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
^[Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/pr/pr_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
Downloading file: s3://nasanex/LOCA/CCSM4/16th/rcp85/r6i1p1/tasmin/tasmin_day_CCSM4_rcp85_r6i1p1_20500101-20501231.LOCA_2016-04-02.16th.nc
/usr/local/lib/python3.5/site-packages/django/db/models/fields/__init__.py:1760: UserWarning: Warning: converting a masked element to nan.
return float(value)
ClimateDataCityCell update SKIPPED for 6. City wasn't modified.
ClimateDataCityCell update SKIPPED for 7. City wasn't modified.
ClimateDataCityCell update SKIPPED for 9. City wasn't modified.
nex2db processing done
bumping the VM memory to 8GB and re-running, both jobs completed successfully
We had done some work awhile back to bring the memory requirements for an ingest down. If this is running out of memory it means we've broken that optimization somewhere. Not great. I'll open a new issue for that which might be worth addressing if its straightforward to track down.
Staging and production both have a custom management command task that allocates 8Gb of memory instead of the usual 1-2gb so we're fine for those environments.
The Amsterdam query succeeds, but the query for Steamboat Springs, CO, returns:
{"detail":"No NEX-GDDP data available for point (40.516280, -106.866007)"}
Looks like it might not be straightforward to track down the memory regression issue now since the original optimization remains in the code -- it would take a bit of debugging work to find the new problem and we already have the workaround ECS tasks. I added a comment linking this PR to the original memory issue.
The Amsterdam query succeeds, but the query for Steamboat Springs, CO, returns
That is correct. Assuming you only ran the commands I showed above, you only imported LOCA data for the US airport locations. If you query for LOCA data instead, you should get results.
There's no LOCA
scenario in the default local setup, it seems.
Doesn't the US ingest command above include RCP85
? That's also what is in the test query above.
There's no LOCA scenario in the default local setup, it seems.
That's correct, but there should be some once you run the create_jobs
+ run_jobs
commands above. Both the create jobs command and the test http calls reference RCP85. You can see if you have any ClimateDataSource objects loaded for RCP85 and LOCA by running the following in a Django shell_plus
:
loca = ClimateDataset.objects.get(name='LOCA')
rcp85 = Scenario.objects.get(name='RCP85')
ClimateDataSource.objects.filter(dataset=loca, scenario=rcp85).count()
On my box, in which I rebuilt the VM then ran the create_jobs commands above, the result of the three Django shell lines above is 3040 rows.
If you're using something other than HTTPie to make your test HTTP queries, make sure you're including the dataset=LOCA
http GET param.
It's possible that the OOM killed jobs messed something up in the database state. As a drastic measure, you could delete all ClimateDataSource objects from your local VM and re-run the create jobs + run jobs commands above.
I'd be interested in seeing the exact series of commands you ran and what the output was if the above doesn't help. Trying to get at whether there's some bug in the importer that could cause trouble on a larger production import.
I also have 3040 records matching the above shell filter.
I was able to get results by modifying the query to:
curl -i "http://localhost:8080/api/climate-data/40.516280/-106.866007/RCP85/?dataset=LOCA" -H "Authorization: Token <TOKEN>"
Overview
Based on prior calculations we're limited to ~15k new cells between the two datasets if we assume linear scaling of database storage required. So, the original plan of using the NLCD "developed land" category is out. That category includes roads, which explodes the number of cells required to ingest for very little value upwards of 100k cells.
So instead we pivoted to two new datasources, the GeoNames place dataset and the OurAirports global airport list. For the US, we're choosing to ingest points for each of the small|medium|large airports -- about 13k cells. Airports are more uniformly distributed across the country while remaining close enough to population centers relative to the resolution of the source climate data. People are generally already familiar with using weather data and forecasts from their local airport so this shouldn't be a cognitive stretch either. For the EU we chose to ingest all population centers with > 100k population as an initial proxy for semi uniform coverage to demo EU availability of data.This filter provides us with about 500 locations putting us close to the 15k cell limit with a bit of buffer.
This PR does a few things:
location_sources
to parse each of these new datasets into ClimateLocation tuples--import-geojson-url
option to the nex2db importer to pair with the shapefile importer. This new option adds support for ingesting data based on Point FeatureCollections.Demo
Points for ingest
US Airports
EU Cities
Notes
A nice side effect of the methods in
climate_data.nex2db.location_sources
is that they can be used to both generate input files for thenex2db
importer and by the importer to parse sources. The new Geonames and OurAirports sources were used to generate the geojson parsed by the GeoJsonUrlLocationSource powering the--import-geojson-url
nex2db option.I generated the us airports and EU cities geojson files and they're in our climate sandbox bucket on S3:
Testing Instructions
Easiest way to test the new files is to run a single local ingest for each of them. Since you're only pulling in one year + var + model + scenario the ingest remains approx constant time as the number of locations increases.
To do an ingest with each file:
Once the ingest is complete, you should be able to query the data you've ingested for any of the locations in either file.
Steamboat Springs CO:
Amsterdam, Netherlands:
Checklist
Has the API documentation been updated, or does this PR not require it?Connects #840