WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Investigate NLCD Alignment Issues #102

Closed rajadain closed 2 years ago

rajadain commented 2 years ago

Following up from https://github.com/WikiWatershed/model-my-watershed/issues/3527, investigate why querying for this shape against NLCD 11 returns legitimate soil values:

echo '{"input": {"polygon": ["{\\"type\\": \\"Polygon\\", \\"coordinates\\": [[[-76.21638738492558, 40.05343943791407], [-76.21494770050049, 40.052650503920056], [-76.21627750074026, 40.050741910042674], [-76.21651822725751, 40.050675368146564], [-76.21679959319628, 40.05334062629735], [-76.21638738492558, 40.05343943791407]]]}"], "polygonCRS": "LatLng", "rasters": ["nlcd-2011-30m-epsg5070-512-int8", "ssurgo-hydro-groups-30m-epsg5070-512-int8"], "rasterCRS": "ConusAlbers", "operationType": "RasterGroupedCount", "zoom": 0}}' | http --body :8090/run
{
    "result": {
        "List(22, 2)": 1,
        "List(23, 2)": 4,
        "List(24, 2)": 2,
        "List(81, 2)": 7,
        "List(82, 2)": 14,
        "List(82, 3)": 1
    }
}

but running it against NLCD 19 returns NODATA values:

echo '{"input": {"polygon": ["{\\"type\\": \\"Polygon\\", \\"coordinates\\": [[[-76.21638738492558, 40.05343943791407], [-76.21494770050049, 40.052650503920056], [-76.21627750074026, 40.050741910042674], [-76.21651822725751, 40.050675368146564], [-76.21679959319628, 40.05334062629735], [-76.21638738492558, 40.05343943791407]]]}"], "polygonCRS": "LatLng", "rasters": ["nlcd-2019-30m-epsg5070-512-byte", "ssurgo-hydro-groups-30m-epsg5070-512-int8"], "rasterCRS": "ConusAlbers", "operationType": "RasterGroupedCount", "zoom": 0}}' | http --body :8090/run
{
    "result": {
        "List(22, -2147483648)": 1,
        "List(23, -2147483648)": 2,
        "List(24, -2147483648)": 4,
        "List(81, -2147483648)": 22
    }
}
rajadain commented 2 years ago

Spent some time today debugging this with @pomadchin and @jpolchlo. We observed that for the erring shape, we get a different tile for the Soil (orange) and NLCD 2011 (green) layers than the NLCD 2019 (purple) layer, because of the way the layout is set in the ingest:

image

The code assumes that all the layers have the same layout, and makes that choice explicit here:

https://github.com/WikiWatershed/mmw-geoprocessing/blob/88f0500d62f501f325f68c7eedcb2d8146d6db65/api/src/main/scala/Utils.scala#L220

where it uses the layout of the first layer for all the rest. In this case, when the spatial keys are offset, we get NODATA values. This only happens to shapes small enough that "fall through the cracks" in these gaps.

Since the assumption that all layers have the same layout is strongly present in our geoprocessing, we should re-ingest the NLCD layers to ensure they line up correctly.

rajadain commented 2 years ago

I reingested the NLCD 2019 dataset with these changes: https://github.com/azavea/civic-apps-etl/commit/a5f939b9c796052afcf1d82a0640b875d7573632

And I'm still seeing the mismatch:

image

rajadain commented 2 years ago

One question that we need answered is: When the same tile is slightly offset (e.g. new NLCD 2019 with old Soil), does that affect every pixel used in the combination or only those on the edges?

jpolchlo commented 2 years ago

I traced this through, and I can confirm that the layout shift is a problem.

rajadain commented 2 years ago

Fantastic work, thank you!

Do you have any easy recommendations for how this new layer can be ingested in a way that matches the existing ones? I tried this: https://github.com/azavea/civic-apps-etl/commit/a5f939b9c796052afcf1d82a0640b875d7573632 but it did not work, and the resulting test layers have the same issues as the original NLCD 2019.

The next step is to request additional funding to fix this. Even a rough estimate of the above would help us come up with a more accurate ask.

rajadain commented 2 years ago

Is there a way to know how big the offset is? Is it larger than 1km?

jpolchlo commented 2 years ago

Reading off the tile extents, it appears to be about 0.75km (assuming that ConusAlbers units are about equivalent to a meter).

rajadain commented 2 years ago

Here's the gdalinfo for the original NLCD file:

Driver: HFA/Erdas Imagine Images (.img)
Files: nlcd_2019_land_cover_l48_20210604.img
       nlcd_2019_land_cover_l48_20210604.ige
       nlcd_2019_land_cover_l48_20210604.rrd
       nlcd_2019_land_cover_l48_20210604.rde
Size is 161190, 104424
Coordinate System is:
PROJCRS["Albers Conical Equal Area",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["Albers Equal Area",
        METHOD["Albers Equal Area",
            ID["EPSG",9822]],
        PARAMETER["Latitude of false origin",23,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-96,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",29.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",45.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["meters",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["meters",1]]]
Data axis to CRS axis mapping: 1,2
Origin = (-2493045.000000000000000,3310005.000000000000000)
Pixel Size = (30.000000000000000,-30.000000000000000)
Corner Coordinates:
Upper Left  (-2493045.000, 3310005.000) (130d13'58.18"W, 48d42'26.63"N)
Lower Left  (-2493045.000,  177285.000) (119d47' 9.98"W, 21d44'32.31"N)
Upper Right ( 2342655.000, 3310005.000) ( 63d40'19.89"W, 49d10'37.43"N)
Lower Right ( 2342655.000,  177285.000) ( 73d35'40.55"W, 22d 4'36.23"N)
Center      (  -75195.000, 1743645.000) ( 96d52'22.83"W, 38d43' 4.71"N)
Band 1 Block=512x512 Type=Byte, ColorInterp=Palette
  Description = Layer_1
  Min=0.000 Max=95.000
  Minimum=0.000, Maximum=95.000, Mean=30.412, StdDev=32.690
  Overviews: 80595x52212, 40298x26106, 20149x13053, 10075x6527, 5038x3264, 2519x1632, 1260x816, 630x408, 315x204, 158x102, 79x51
  Metadata:
    LAYER_TYPE=thematic
    OVERVIEWS_ALGORITHM=IMAGINE Nearest Neighbor Resampling
    STATISTICS_EXCLUDEDVALUES=
    STATISTICS_HISTOBINVALUES=7853863229|0|0|0|0|0|0|0|0|0|0|472399232|962418|0|0|0|0|0|0|0|0|240566180|153288747|92578072|33121466|0|0|0|0|0|0|87406005|0|0|0|0|0|0|0|0|0|833976610|1033039764|305029988|0|0|0|0|0|0|0|0|1961779404|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1198000354|0|0|0|0|0|0|0|0|0|560647664|1464715609|0|0|0|0|0|0|0|403631293|0|0|0|0|137098525|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
    STATISTICS_HISTOMAX=255
    STATISTICS_HISTOMIN=0
    STATISTICS_HISTONUMBINS=256
    STATISTICS_MAXIMUM=95
    STATISTICS_MEAN=30.412349783312
    STATISTICS_MEDIAN=21
    STATISTICS_MINIMUM=0
    STATISTICS_MODE=0
    STATISTICS_SKIPFACTORX=1
    STATISTICS_SKIPFACTORY=1
    STATISTICS_STDDEV=32.689899151583

As can be seen, it's not in Conus Albers, but in an Albers Conical Equal Area projection. To get it into EPSG:5070 Conus Albers, I use:

gdalwarp -t_srs 'EPSG:5070' nlcd_2019_land_cover_l48_20210604.img nlcd-2019-test-30m-epsg5070.tif

Once that completes, the gdalinfo is as follows:

gdalinfo nlcd-2019-test-30m-epsg5070.tif
Driver: GTiff/GeoTIFF
Files: nlcd-2019-test-30m-epsg5070.tif
       nlcd-2019-test-30m-epsg5070.tif.aux.xml
Size is 161190, 104424
Coordinate System is:
PROJCRS["NAD83 / Conus Albers",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4269]],
    CONVERSION["Conus Albers",
        METHOD["Albers Equal Area",
            ID["EPSG",9822]],
        PARAMETER["Latitude of false origin",23,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-96,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",29.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",45.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Data analysis and small scale data presentation for contiguous lower 48 states."],
        AREA["United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming."],
        BBOX[24.41,-124.79,49.38,-66.91]],
    ID["EPSG",5070]]
Data axis to CRS axis mapping: 1,2
Origin = (-2493045.000022565945983,3310004.999959179200232)
Pixel Size = (29.999999999992369,-29.999999999992369)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (-2493045.000, 3310005.000) (130d13'58.18"W, 48d42'26.63"N)
Lower Left  (-2493045.000,  177285.000) (119d47' 9.98"W, 21d44'32.31"N)
Upper Right ( 2342655.000, 3310005.000) ( 63d40'19.89"W, 49d10'37.43"N)
Lower Right ( 2342655.000,  177285.000) ( 73d35'40.55"W, 22d 4'36.23"N)
Center      (  -75195.000, 1743645.000) ( 96d52'22.83"W, 38d43' 4.71"N)
Band 1 Block=161190x1 Type=Byte, ColorInterp=Palette
  Description = Layer_1
  Metadata:
    LAYER_TYPE=thematic
    OVERVIEWS_ALGORITHM=IMAGINE Nearest Neighbor Resampling

As can be seen, the numbers don't line up exactly the same:

diff --git a/a.txt b/b.txt
index 9eeed60..bec34b6 100644
--- a/a.txt
+++ b/b.txt
@@ -1,19 +1,18 @@
-Driver: HFA/Erdas Imagine Images (.img)
-Files: nlcd_2019_land_cover_l48_20210604.img
-       nlcd_2019_land_cover_l48_20210604.ige
-       nlcd_2019_land_cover_l48_20210604.rrd
-       nlcd_2019_land_cover_l48_20210604.rde
+gdalinfo nlcd-2019-test-30m-epsg5070.tif
+Driver: GTiff/GeoTIFF
+Files: nlcd-2019-test-30m-epsg5070.tif
+       nlcd-2019-test-30m-epsg5070.tif.aux.xml
 Size is 161190, 104424
 Coordinate System is:
-PROJCRS["Albers Conical Equal Area",
-    BASEGEOGCRS["WGS 84",
-        DATUM["World Geodetic System 1984",
-            ELLIPSOID["WGS 84",6378137,298.257223563,
+PROJCRS["NAD83 / Conus Albers",
+    BASEGEOGCRS["NAD83",
+        DATUM["North American Datum 1983",
+            ELLIPSOID["GRS 1980",6378137,298.257222101,
                 LENGTHUNIT["metre",1]]],
         PRIMEM["Greenwich",0,
             ANGLEUNIT["degree",0.0174532925199433]],
-        ID["EPSG",4326]],
-    CONVERSION["Albers Equal Area",
+        ID["EPSG",4269]],
+    CONVERSION["Conus Albers",
         METHOD["Albers Equal Area",
             ID["EPSG",9822]],
         PARAMETER["Latitude of false origin",23,
@@ -35,39 +34,32 @@ PROJCRS["Albers Conical Equal Area",
             LENGTHUNIT["metre",1],
             ID["EPSG",8827]]],
     CS[Cartesian,2],
-        AXIS["(E)",east,
+        AXIS["easting (X)",east,
             ORDER[1],
-            LENGTHUNIT["meters",1]],
-        AXIS["(N)",north,
+            LENGTHUNIT["metre",1]],
+        AXIS["northing (Y)",north,
             ORDER[2],
-            LENGTHUNIT["meters",1]]]
+            LENGTHUNIT["metre",1]],
+    USAGE[
+        SCOPE["Data analysis and small scale data presentation for contiguous lower 48 states."],
+        AREA["United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming."],
+        BBOX[24.41,-124.79,49.38,-66.91]],
+    ID["EPSG",5070]]
 Data axis to CRS axis mapping: 1,2
-Origin = (-2493045.000000000000000,3310005.000000000000000)
-Pixel Size = (30.000000000000000,-30.000000000000000)
+Origin = (-2493045.000022565945983,3310004.999959179200232)
+Pixel Size = (29.999999999992369,-29.999999999992369)
+Metadata:
+  AREA_OR_POINT=Area
+Image Structure Metadata:
+  INTERLEAVE=BAND
 Corner Coordinates:
 Upper Left  (-2493045.000, 3310005.000) (130d13'58.18"W, 48d42'26.63"N)
 Lower Left  (-2493045.000,  177285.000) (119d47' 9.98"W, 21d44'32.31"N)
 Upper Right ( 2342655.000, 3310005.000) ( 63d40'19.89"W, 49d10'37.43"N)
 Lower Right ( 2342655.000,  177285.000) ( 73d35'40.55"W, 22d 4'36.23"N)
 Center      (  -75195.000, 1743645.000) ( 96d52'22.83"W, 38d43' 4.71"N)
-Band 1 Block=512x512 Type=Byte, ColorInterp=Palette
+Band 1 Block=161190x1 Type=Byte, ColorInterp=Palette
   Description = Layer_1
-  Min=0.000 Max=95.000
-  Minimum=0.000, Maximum=95.000, Mean=30.412, StdDev=32.690
-  Overviews: 80595x52212, 40298x26106, 20149x13053, 10075x6527, 5038x3264, 2519x1632, 1260x816, 630x408, 315x204, 158x102, 79x51
   Metadata:
     LAYER_TYPE=thematic
     OVERVIEWS_ALGORITHM=IMAGINE Nearest Neighbor Resampling
-    STATISTICS_EXCLUDEDVALUES=
-    STATISTICS_HISTOBINVALUES=7853863229|0|0|0|0|0|0|0|0|0|0|472399232|962418|0|0|0|0|0|0|0|0|240566180|153288747|92578072|33121466|0|0|0|0|0|0|87406005|0|0|0|0|0|0|0|0|0|833976610|1033039764|305029988|0|0|0|0|0|0|0|0|1961779404|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1198000354|0|0|0|0|0|0|0|0|0|560647664|1464715609|0|0|0|0|0|0|0|403631293|0|0|0|0|137098525|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
-    STATISTICS_HISTOMAX=255
-    STATISTICS_HISTOMIN=0
-    STATISTICS_HISTONUMBINS=256
-    STATISTICS_MAXIMUM=95
-    STATISTICS_MEAN=30.412349783312
-    STATISTICS_MEDIAN=21
-    STATISTICS_MINIMUM=0
-    STATISTICS_MODE=0
-    STATISTICS_SKIPFACTORX=1
-    STATISTICS_SKIPFACTORY=1
-    STATISTICS_STDDEV=32.689899151583
rajadain commented 2 years ago

The coordinates are the same, but the Origin and Pixel Size are subtly different, which may be enough to cause the drift we observe.

rajadain commented 2 years ago

If I open up both layers in QGIS, and then set them to 50% visibility so I can see them overlaid, I see some unalignment at lower zoom levels:

image

You can see the not-perfectly-aligned, off-by-some-pixels effect when the screenshot above is enlarged:

image

However, when zooming in to the same spot in QGIS, we lose the lack of alignment, and it is perfectly overlaid:

image

I don't know what to make of this.

rajadain commented 2 years ago

The gdalinfo for the old 2011 file (as found in the fileshare):

gdalinfo nlcd_2011.arg
Driver: ARG/Azavea Raster Grid format
Files: nlcd_2011.arg
       ./nlcd_2011.json
Size is 161190, 104424
Coordinate System is:
GEOGCRS["NAD83",
    DATUM["North American Datum 1983",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4269]]
Data axis to CRS axis mapping: 2,1
Origin = (-2493045.000000000000000,3310005.000000000000000)
Pixel Size = (30.000000000000000,-30.000000000000000)
Metadata:
  LAYER=nlcd_2011
Corner Coordinates:
Upper Left  (-2493045.000, 3310005.000) (Invalid angle,Invalid angle)
Lower Left  (-2493045.000,  177285.000) (Invalid angle,Invalid angle)
Upper Right ( 2342655.000, 3310005.000) (Invalid angle,Invalid angle)
Lower Right ( 2342655.000,  177285.000) (Invalid angle,Invalid angle)
Center      (  -75195.000, 1743645.000) (Invalid angle,Invalid angle)
Band 1 Block=161190x1 Type=Byte, ColorInterp=Undefined
  NoData Value=25
rajadain commented 2 years ago

I'm going to attempt to ingest the 2019 raster into GeoTrellis without reprojecting it with GDAL and will see how that goes. Just to test things out.

rajadain commented 2 years ago

I ingested the NLCD 2019 raster as is into an RDD, without reprojecting it into EPSG:5070, and found that it matched the NLCD 2019 EPSG:5070 RDD, and continues to be offset from the 2011 NLCD:

image

Here, the green nlcd-2019-30m-epsg5070-512-byte is the EPSG:5070 NLCD 2019, and the blue nlcd-2019-test-30m-512-byte is the un-translated NLCD 2019. The red nlcd-2011-30m-epsg5070-512-int8 is the original NLCD 2011.

This may indicate that the issue is in our ETL script, and can be fixed therein.

rajadain commented 2 years ago

I downloaded a tile of the nlcd-2019-30m-epsg5070-byte raster, and the un-reprojected nlcd-test-30m-byte raster, and diffed them using gdal_calc.py:

gdal_calc.py -A nlcd-2019-30m-epsg5070-512-byte.tiff -B nlcd-2019-test-30m-512-byte.tiff --outfile=diff.tif --calc="A-B"
0.. 2.. 5.. 8.. 11.. 14.. 17.. 20.. 22.. 25.. 28.. 31.. 34.. 37.. 40.. 42.. 45.. 48.. 51.. 54.. 57.. 60.. 62.. 65.. 68.. 71.. 74.. 77.. 80.. 82.. 85.. 88.. 91.. 94.. 97.. 100 - Done

Then I inspected the diff and found it had all 0 values:

gdalinfo -stats diff.tif | grep STATISTICS
    STATISTICS_MAXIMUM=0
    STATISTICS_MEAN=0
    STATISTICS_MINIMUM=0
    STATISTICS_STDDEV=0
    STATISTICS_VALID_PERCENT=100

I also diffed the other way B-A for the same result. This indicates that our reprojection step actually doesn't change the ingested output at all.

rajadain commented 2 years ago

The original 2011 data is available here: https://www.sciencebase.gov/catalog/item/5dfc2280e4b0b207a9fe8235

I'm now ingesting that via our ETL pipeline to see if we have the same tiling issue, or if it matches the old tiles.

rajadain commented 2 years ago

I ingested the above into RDD and it has the same tile misalignment as the NLCD 2019 tiles:

image

So the issue is quite possibly the code in https://github.com/azavea/civic-apps-etl/tree/tt/nlcd-2016-ingest

jpolchlo commented 2 years ago

I have written an ingest tool and posted it at https://github.com/jpolchlo/mmw-etl. This uses GT 3.6.2 and Spark 3.1.2. We have moved on from the ETL package, so this tool rolls a direct solution using vanilla GT and RasterSources. The documentation describes how to use it from SBT with the SBT Lighter plugin, which will create an EMR cluster and issue jobs as steps to that cluster. (Users will likely have to adjust the setup in the build.sbt file to use different PEM files.)

I've ingested the NLCD 2019 layer, and the tile boundaries now seem to match in my testing. Take a look at the nlcd-2019-test4-30m-epsg5070-512-byte layer in the datahub catalog.

Let me know if that test NLCD layer works as expected.

rajadain commented 2 years ago

Fantastic! Unfortunately, the layer was ingested using the systems+gt account instead of systems+datahub and I can't access it using Trelliscope or the MMW Geoprocessing service, both of which use the datahub account:

image
aws --profile=azavea-datahub s3api get-object-acl --bucket datahub-catalogs-us-east-1 --key _attributes/metadata__nlcd-2019-test4-30m-epsg5070-512-byte__0.json | jq .Owner.DisplayName

An error occurred (AccessDenied) when calling the GetObjectAcl operation: Access Denied

The datahub credentials are available in LastPass under MMW Azavea DataHub AWS. Could you please redo the process using that?

rajadain commented 2 years ago

Just reran the ingest with the correct account:

sbt:mmw-etl> run --source-image /home/rajadain/data/mmw/nlcd_2019_land_cover_l48_20210604.tif --reference-catalog s3://datahub-catalogs-us-east-1 --reference-layer ssurgo-hydro-groups-30m-epsg5070-512-int8 --output-catalog s3://datahub-catalogs-us-east-1 --output-layer nlcd-2019-test5-30m-epsg5070-512-byte

Took a little more than 30 minutes running it locally:

Total time: 2010 s (33:30), completed Jun 7, 2022 4:00:51 PM

Seeing promising results where the layers are tiled correctly:

image

Next going to check the GeoTiffs and ensure they line up

rajadain commented 2 years ago

The tiles line up!

image

This is great work.

This ingest was done using the sbt shell. Next I'm going to make an assembly and then spark-submit it, and see if that works. Once that is working, then I'll write a script to ingest all the NLCD 2019 product with this new tool. Thanks so much for all your work on this!

jpolchlo commented 2 years ago

For the record, I don't think you'll get any additional mileage from spark-submit over the sbt-mediated run. The improvement that is going to give the best return is using EMR, but it seems like even that is only a marginal improvement in terms of runtime, since the time to start and bootstrap a cluster is going to eat into performance gains.

It's worth having a conversation about whether this updated ingest tool should migrate out of my personal github and get incorporated into an Azavea-owned repo.