Open CarsonPruitt-NOAA opened 1 year ago
Most of the retrieval functionality will be within retrieve_and_reproject_3dep_for_huc()
of acquire_and_preprocess_inputs.py
.
You'll also want to take a look at the $dem_source
environment variable and apply that to the relevant functionality in fim_pipeline.sh
likely the run_by_*.sh
scripts and *.env
files.
Initial results using 5m 3DEP data downscaled to 10m look promising -- there's a slight (~0.5%) increase in CSI
huc version benchmark csi
12040101 4.3.9.0 BLE_100yr 0.630
12040101 3dep_5m BLE_100yr 0.633
12040101 4.3.9.0 BLE_500yr 0.645
12040101 3dep_5m BLE_500yr 0.648
Adding more results from a second HUC and also comparing 3dep 10m. The second HUC (12090301) showed a bit of a regression with 3DEP.
4.3.9.0
: USGS 10m downloaded from https://prd-tnm.s3.amazonaws.com/StagedProducts/Elevation/13/TIFF/USGS_Seamless_DEM_13.vrt with data/usgs/acquire_and_preprocess_3dep_dems.py
(/data/inputs/usgs/3dep_dems/10m_5070/fim_seamless_3dep_dem_5070.vrt
)3dep_10m
: 3DEP data downloaded at 10m3dep_5m
: 3DEP data downloaded at 5m downscaled to 10mhuc version benchmark csi
12040101 4.3.9.0 BLE_100yr 0.630
12040101 3dep_10m BLE_100yr 0.631
12040101 3dep_5m BLE_100yr 0.633
12040101 4.3.9.0 BLE_500yr 0.645
12040101 3dep_10m BLE_500yr 0.647
12040101 3dep_5m BLE_500yr 0.648
12090301 4.3.9.0 BLE_100yr 0.697
12090301 3dep_10m BLE_100yr 0.690
12090301 3dep_5m BLE_100yr 0.690
12090301 4.3.9.0 BLE_500yr 0.719
12090301 3dep_10m BLE_500yr 0.715
12090301 3dep_5m BLE_500yr 0.714
First download attempt was bumpy.. TONS of HTTPSConnectionPool(host='elevation.nationalmap.gov', port=443): Read timed out, but we have a retry system in it for up to 5 attempts. We tried it with only one huc 06010203 at -d 10 and -v 10 at 80 jobs. It was able to download 326 of 3,242 tiled DEMS, but the rest failed.
Tried it again using the -r (retry) to see how it goes.
We are also trying to figure out duration and space, so we are using one HUC. Each mini tile DEM is very small (< 100 KB). Each .tif is used to create a mini tiled .gkpg and they are usually ~ 100 KB, so it is good. but shear volume coudl be a challenge. If we have 3,242 files x 2 (gpkg / tif) for just one HUC.
I used the WBD_National.gpkg to save off all HUC8 polys into there own .gpkg which is a requirement for this too run. I did this via QGIS and did not create a script to create the HUC8 polys. However.. We do not use all of the 2,300+ HUC8 as many are Alaska, Canada and the Pacific. When we run this at scale, we will use the inputs/huc_lists/included_huc8.lst to manually clean up the folder with all of the HUC8 WBD polys, so we don't bring down the 200 (ish) HUCs that are don't use. This program will use ALL wbd HUC8 polys as a domain cuts when calling USGS for downloads.
A fix is required for nhd tiles
There is a line of code saying nhd_dem_fp = os.path.join(f'/data/inputs/nhdplus_rasters/HRNHDPlusRasters{huc}/elev_m.tif')
. Two problems with it. We don't use any nhdplus_rasters anymore and they are no longer even loaded in AWS enviros. We are now moved over to NWM. I wonder if this needs to be run against NWM, not NHD and also change the pathing.
I understand that this is how Fernando did it but he was working against a very old FIM4 version before we finished removing all NHD rasters. We might need to rethink this one.
And.. we likely have to change the hardcoding (somehow).
One noted functionality that was lost along the way was with the first version of the acquire_and_preprocess_3dep_dems.py. It was lots prior to this build. It has a piece of odd logic in it that when the -r (retry) flag was included in the args, it would scan through the current outputs and look for all files that were less than 10 mg as USGS has a bug in there system where it can start downloading and it can abort up to 10 MB. We could not tell if the download was complete for anyone one original 3dep file so we opted (ugly but it worked) to see if the file was less than 10 MB, delete it at retry start so it could attempt a re-download. It was normally to need 4 - 6 attempts to get all hucs down.
It is fine that this functionality was dropped, but we need to keep in mind that we still don't have a way to know if the download was completed. Granted, these ones are so small, it doesn't really matter much. But.. we do need to figure out a way to see if all of the expected tiles came down.
Speaking of missing tiles. I recommend a new features built into it.
See if we can a download of the tile list for a given HUC which we can use to validate if some of the tiles did not download programmatically. We can also load up the tile list vector layer into QGIS to visually analyze the tile list. (not sure if this idea will work, but I am guessing it will).
I also wonder if we need to not log so much of the HTTPS connection pool errors to screen. It is so much that it is burying true errors. Many of our systems now have a feature that prints logs, BUT a separate file that logs just filtered logs. (aka.. maybe don't error log all timeouts but only an error entry if it didn't make it after 5 attempts. This will help bubble up true errors and truly missing tiles.
It is perfectly fine to fix / upgrade Fernando's logic. We have had to do it before and the logic in his code is pretty simple, so I think we are fine to upgrade it to our needs and also to be more efficient considering volume loads. I believe Fernando did not attempt large volumes so often we have to adjust for overall volumes.
I also worry about total number of files in folder. Can it pull down all of the tiles for one huc, merge them, then delete all of its working tifs? (just an idea to keep the total number of files down as we could easily have over 6,000,000 files in the folder (3,200 x 2,100 huc). If you accidently click on that folder, it will have trouble. Maybe we need a subfolder for each huc in the outputs for manageability.
Also.. Do we know if a tile cross a huc boundary? We need to understand what happens on both sides of a huc if it does cross the boundary. (likely will need to especially with the 5,000 buffer). So we might have one tile in two hucs. Either way, we need to understand the impacts.
First download attempt was bumpy.. TONS of HTTPSConnectionPool(host='elevation.nationalmap.gov', port=443): Read timed out, but we have a retry system in it for up to 5 attempts. We tried it with only one huc 06010203 at -d 10 and -v 10 at 80 jobs. It was able to download 326 of 3,242 tiled DEMS, but the rest failed.
Tried it again using the -r (retry) to see how it goes.
We are also trying to figure out duration and space, so we are using one HUC. Each mini tile DEM is very small (< 100 KB). Each .tif is used to create a mini tiled .gkpg and they are usually ~ 100 KB, so it is good. but shear volume coudl be a challenge. If we have 3,242 files x 2 (gpkg / tif) for just one HUC.
I used the WBD_National.gpkg to save off all HUC8 polys into there own .gpkg which is a requirement for this too run. I did this via QGIS and did not create a script to create the HUC8 polys. However.. We do not use all of the 2,300+ HUC8 as many are Alaska, Canada and the Pacific. When we run this at scale, we will use the inputs/huc_lists/included_huc8.lst to manually clean up the folder with all of the HUC8 WBD polys, so we don't bring down the 200 (ish) HUCs that are don't use. This program will use ALL wbd HUC8 polys as a domain cuts when calling USGS for downloads.
Yes, in my experience it works much better with the retry (`-r) flag. The coverage is about 50% without it (single pass).
Tile size is a parameter so they could be made bigger with fewer tiles. But that might also mean longer download times and more chance for timing out?
I was testing on a HUC8 poly because I wanted to test evaluate a single HUC8, but I was anticipating that the "full" run would use the same HUC6 polys that were previously used to download from the VRT (/data/inputs/wbd/HUC6_5070
).
A fix is required for nhd tiles There is a line of code saying
nhd_dem_fp = os.path.join(f'/data/inputs/nhdplus_rasters/HRNHDPlusRasters{huc}/elev_m.tif')
. Two problems with it. We don't use any nhdplus_rasters anymore and they are no longer even loaded in AWS enviros. We are now moved over to NWM. I wonder if this needs to be run against NWM, not NHD and also change the pathing.I understand that this is how Fernando did it but he was working against a very old FIM4 version before we finished removing all NHD rasters. We might need to rethink this one.
And.. we likely have to change the hardcoding (somehow).
Yes, that should be updated from NHD to /data/inputs/3dep_dems/10m_5070/fim_seamless_3dep_dem_10m_5070.vrt
(or $input_DEM
so that it's not hardcoded).
Related to an earlier discussion about the source and availability of higher resolution 3DEP DEM data, this function would be useful:
py3dep.py3dep.query_3dep_sources(bbox, crs=4326, res=None))
Query 3DEP's data sources within a bounding box.
This function queries the availability of the underlying data that 3DEP uses
at the following resolutions:
1 m, 3 m, 5 m, 10 m, 30 m, 60 m, and topobathy (integrated topobathymetry).
Parameters
----------
bbox : tuple
Bounding box as tuple of ``(min_x, min_y, max_x, max_y)``.
crs : str, int, or pyproj.CRS or pyproj.CRS, optional
Spatial reference (CRS) of bbox, defaults to ``EPSG:4326``.
res : str, optional
Resolution to query, defaults to ``None``, i.e., all resolutions.
Returns
-------
geopandas.GeoDataFrame
Polygon(s) representing the 3DEP data sources at each resolution.
Resolutions are given in the ``dem_res`` column.
Moving to backlog because this isn't being actively worked on this Sprint.
@fernando-aristizabal has developed a protocol to query the best-available resolution DEMS from the 3DEP service using the py3dep repo and downsample the data to the desired FIM resolution (10m). His code needs to be retrieved from the from the
dev-lidar
branch and integrated into our current workflow for downloading our base DEMs.Please also work with @RobHanna-NOAA if you have any questions about how our DEM download currently works.
dev-lidar
branch need to be integrateddata/usgs/acquire_and_preprocess_3dep_dems.py