Process & display the Circum-Arctic permafrost and ground ice map

robyngit commented 1 year ago

It was requested that we display a permafrost layer as the base layer when someone visits the PDG for the first time. The permafrost layer could be combined with the ice-wedge polygon map, so two layers presented as base layers.

Feedback collected by Moss & Andrew (the K12 teacher in Bethel) indicated that anyone coming to the PDG does not see/understand (right away) that the PDG focus is on permafrost.

The ideal permafrost layer would be the Circum-Arctic Map of Permafrost and Ground-Ice Conditions, Version 2.

According to the metadata, it looks to be a 24 MB shapefile.

mbjones commented 1 year ago

This is a great idea. I'm curious how that relates to the two other general permafrost layers we have --

Y Ran 2021 Biophysical Permafrost Zones that is on the PDG already
Obo et al 2018 Permafrost probablility layers on Fluid Earth Viewer (see https://doi.org/10.1016/j.earscirev.2019.04.023)

It seems like we could also show the Obo layers (soil temp and permafrost probability) on our imagery viewer.

julietcohen commented 1 year ago

Initial Data Exploration

I agree that this layer looks like it would be very helpful for users to understand the IWP layer! The overall data directory consists of 3 shapefiles:

permaice.shp - 21 MB, 13,671 rows
- contains attributes:
- NUM_CODE - 0 NA values
- COMBO- 0 NA values
- RELICT - 13658 NA values
- EXTENT - 6480 NA values
- CONTENT - 6480 NA values
- LANDFORM - 6639 NA values
- I have yet to find definitions for each.
subsea.shp - 18 KB
treeline.shp - 16 KB Plus a few other small .byte and .hdr files.

Notes:

The README in the data dir points to a documentation page that no longer exists.
The abstract specifies: "The data set also contains the location of subsea and relict permafrost."
The temporal coverage is "Not Specified" in the surface level metadata but the XML specifies 2002-02-01 as the dataset release date

`permaice.shp`

`subsea.shp`:

`treeline.shp`:

julietcohen commented 1 year ago

Update:

Started staging the file permaice.shp yesterday with:

no parallelization (small input data size)
no deduplication
max z-level 11 (for the first run, to see how it looks)
palette of purple with transparency, similar to the way we use yellow transparency for the IWP dataset

Staging has been ongoing on Datateam for almost a full day and is at >604,000 staged files. I will continue to let it run while I take care of other higher priority tasks.

julietcohen commented 1 year ago

Update:

Within the same job mentioned in the previous comment, staging completed for permaice.shp with 1,653,964 gpkg files and the rasterization step has started.
plotted a few of these GeoPackages, looks like each has 1 to a few large polygons:

julietcohen commented 1 year ago

Workflow update

I set up a parsl job on Datateam to execute rasterization and web-tiling in parallel, since rasterization was going very slowly without parallelization. After running over the weekend, rasters were produced for the staged tiles for z-11 (the highest z-level I set in the config), but a parsl error occurred during rasterization for z-level 10: parsl.executors.high_throughput.errors.WorkerLost: Task failure due to loss of worker 5 on host datateam

julietcohen commented 1 year ago

I started a new parsl job to pick up where the workflow left off: restarting z-10, then the lower z-levels, then web-tiling.

julietcohen commented 1 year ago

Color palette based on an attribute

Visualized the web-tiles for just z-11 and z-10 (since those are the only z-levels that I've been able to produce on Datateam so far without running into an OOM error, even with parallelization) using "coverage" for the statistic since this is just a first pass. I think it makes sense to instead use an attribute of the data, such as "EXTENT". The documentation explains what the codes for this attribute mean:

c = continuous
d = discontinuous
s = sporadic
i = isolated patches

47% of the observations have nan for this attribute. These would have to be removed before staging for the rasterization step, which cannot gracefully handle nan values yet.

Could also use the attribute "RELICT" which only contains "yes" or nan. 99% of those observations are nan so probably not a good path forward. If we assume (or find in the documentation) that nan means "no" for this attribute, we could use it instead of removing all those observations, but it seems like it would result in a very uniform and uninteresting palette.

julietcohen commented 1 year ago

Update on dataset processing and assigning color palette to attribute

cleaned data to remove rows with nan for the "extent" attribute
- this is necesary for rasterization (cannot resample rasters with NA cell/pixel values)
- also helpfully reduces the amount of input data so the dataset can be processed on Datateam with parsl without running out of memory
assigned an integer code ranging from 1-4 to the categorical extent attribute with 4 categories (created a new integer attribute)
in config, made a statistic coverage_category with a pre-set range of values for that attribute: [1,4]

Over the weekend, the script staged and rasterized all tiles for all z-levels with no OOM errors. However, the web-tiling failed with: cannot convert float NaN to integer. for every raster.

Debugging revealed that the issue was within viz-raster/pdgraster/WebImage.py, when we convert a raster to a Python Imaging Library image. Within to_image(), we create a no_data_mask that represents the values within the image data that have no data (0 in this case) and we replace all those values with np.nan. See here. I was able to resolve the error by converting the image_data array to float right before we create the no_data_mask.

to_image()

```python def to_image(self, image_data): """ Create a PIL image from the pixel values. Parameters ---------- pixel_values : pandas.DataFrame A dataframe with a pixel_row, pixel_col, and values column. Returns ------- PIL.Image A PIL image. """ min_val = self.min_val max_val = self.max_val nodata_val = self.nodata_val image_data = image_data.copy() rgba_list = self.rgba_list height = self.height width = self.width image_data = image_data.astype(float) no_data_mask = image_data == nodata_val # set the nodata value to np.nan if(len(no_data_mask)): image_data[no_data_mask] = np.nan ... ```

There will be more aspects of the viz-workflow to tweak before this dataset is complete, such as the palette. Hopefully, this is one step closer to adjusting the workflow to enable styling the 3D tiles and web tiles based on an attribute of interest. See viz-workflow issue#9).

julietcohen commented 1 year ago

Using cesium to preview the web tiles, we see that the polygons that span certain Arctic latitudes are distorted to create rings around the North pole:

Perhaps the projection of the default TMS for the viz workflow distorts the geometries that cross the Arctic circle (66 degrees) as well as the geometries that cross the latitude of the thinner, more northern ring.

These rings are not present in the input data to the visualization workflow. When we plot both the raw data and the cleaned input data, with the palette applied to the "EXTENT" attribute, we see a map without distortion. Here's the cleaned data with a viridis palette:

These rings would certainly make the output data confusing for users on the PDG portal.

julietcohen commented 1 year ago

Rings in tileset result from reprojecting the geometries to the CRS of the TMS

After re-projecting the CRS of the cleaned geodataframe to the CRS of the TMS, the data looks like:

Because we need to use this TMS, a possible solution is to edit the input data of the viz workflow by converting the CRS of the geopackage beforehand, then masking it to only retain polygons that fall over land, then feeding that geodataframe into the viz workflow.

julietcohen commented 1 year ago

Removed Invalid Geometries

The invalid geometries that were displayed like rings around the Arctic Circle as shown above were resulting from the conversion to the CRS of the TMS during staging only from the geometries that intersected the antimeridian. There were 6 geometries that intersected, and removing them before starting the workflow resulted in the following tiles:

The following script was used to process the data before executing the workflow.

clean_data.py

``` # Clean permaice data from Brown et al. 2002 # Data source: https://nsidc.org/data/ggd318/versions/2 # Author: Juliet Cohen # Date: 2023-08-09 # Cleaning steps include: # 1. remove NA values from extent attribute # 2. add a dummy coding for the extent attribute # 3. remove polygons that intersect the antimeridian # conda env: viz_3-10_local import geopandas as gpd import numpy as np import morecantile input = "/home/jcohen/permafrost_ground_layer/data/permaice.shp" perm = gpd.read_file(input) # drop rows that have missing value for extent attribute perm.dropna(subset = ['EXTENT'], inplace = True) # add column that codes the categorical extent strings into numbers # in order to do stats with the workflow and assign palette to this # first, define the conditions and choices for new extent_code attribute conditions = [ (perm['EXTENT'] == "C"), (perm['EXTENT'] == "D"), (perm['EXTENT'] == "S"), (perm['EXTENT'] == "I") ] choices = [4, 3, 2, 1] # Use numpy.select() to assign extent_code based on the conditions and choices perm['extent_code'] = np.select(conditions, choices) # convert the CRS to WGS84 to distort the geometries that intersect # the antimeridian, so we can identify which polygons are problematic # pull the CRS from the TMS tms = morecantile.tms.get("WGS1984Quad") workflow_tms = tms.crs perm.to_crs(workflow_tms, inplace = True) # sort the geoms based on the largest longitudinal extent (wrap around the world) # subtract the minimum longitude value (minx) from the maximum longitude value (maxx) perm['longitude_extent'] = perm['geometry'].apply(lambda geom: geom.bounds[2] - geom.bounds[0]) sorted_gdf = perm.sort_values(by = 'longitude_extent', ascending = False) # define the problematic polygons as the polygons with a longitude extent greater than 180 antimeridian_polys = [index for index, row in sorted_gdf.iterrows() if row['longitude_extent'] > 180] # remove the indices from the cleaned_gdf to ensure we did not miss any problem polygons nonproblematic_polys = perm.drop(antimeridian_polys) # drop the column created for longitude_extent, no longer necessary nonproblematic_polys.drop(columns=['longitude_extent'], inplace = True) # save the nonproblematic polygons as a cleaned input file for the viz workflow output_file = "/home/jcohen/permafrost_ground_layer/cleaned_data/permaice_clean.gpkg" nonproblematic_polys.to_file(output_file, driver = "GPKG") print("Cleaning complete.") ```

Next Steps

The final details to work out are:

Split the 6 polygons at the antimeridian rather than removing them
- could write custom function to split / shift the polygon vertices or convert the data to a GeoJSON and use the antimeridan package
assign a palette (by adjusting the config and re-executing just web-tiling) to properly represent all the values of the attribute of interest (the extent dummy coding)
- using both "aggregation_method": "max" (shown in the screenshot), and "aggregation_method": "mean" show mostly web tiles of one color, and with slight variation:
- Using palette:
```
"palette": [
    "#e41ba866",
    "#af18cd66"
]
```
  Note: 66 = 40% alpha and looks good, considering we want users to overlay IWP on this layer and still see the terrain

There should be 4 values that are represented:

julietcohen commented 1 year ago

First draft ready for the PDG

The palette now displays all 4 colors. The data package on the ADC will have the one input file from Brown et al. 2022, the geopackages, and the rasters. The package must be published with a new DOI (A2MG7FX35) before we can move the layer onto production so that users can access the data/metadata. Per Matt's suggestion, the ADC metadata will include provenance triples indicating the derivation relationship with the National Snow and Ice Data Center.

Web tiles need to be re-processed with a blue palette, per Anna's request, because this dataset is usually visualized in blue by others.

julietcohen commented 1 year ago

Layer with blue palette

Layer with the default 40% opacity and IWP overlaid:

When this layer was first uploaded to demo, the incorrect legend order revealed a bug, described in this issue. A fix in the XML was made so the value of each legend item determines the order (1,2,3,4), rather than the string label.

The data for this layer are archived at the ADC at /var/data/10.18739/A2MG7FX35:

cleaned geopackage permaice_clean.gpkg that is fed into the viz workflow
script that cleaned the input data
geopackages
geotiffs
associated csv files for geopackages and geotiffs
the processing script to run the viz workflow
config

The web tiles are in: var/data/tiles/0.18739/A2MG7FX35/

To do:

[x] create metadata package on production
[x] upload a single geopackage and a geotiff to create spatial metadata for them
[x] split polygons at the antimeridian instead of removing them
[ ] convert these entities to ghost entities, remove the single geopackage and geotiff from the ADC package itself
[ ] point to the data archives at /var/data/10.18739/A2MG7FX35 in abstract so users know where to find the data
[ ] add provenance triples to point to the original data archived at the NSIDC
[ ] after publishing with DOI, transfer this layer to the production PDG portal

julietcohen commented 1 year ago

Update on splitting polygons that intersect the antimeridian

I created a script that:

identifies the polygons that cross the antimeridian without needing to transform them to WGS84 first
- the data comes in the Lambert Azimuthal projection, and WGS84 is the CRS of the TMS, and transforming the polygons deforms them to wrap around the world the opposite way
creates a line at (what seems like) the antimeridian in the CRS of the input data
splits the polygons at this line, and retains their other attributes
appends these polygons to the input data so they will be processed with the other polygons in the viz-workflow

Problem: After converting these spit polygons to WGS84, they are still deformed in the same way (wrap around the other side of the world). A potential reason for this could be that the way I split the polygons results in their split side touching or sharing the antimeridian, so the polygons are still not able to be transformed to WGS84 correctly. This may be resolved by:

buffering the antimeridian line to shift the border of the polygon away from the antimeridian in all split polygons
or similarly, creating a long strip-like polygon at the antimeridian instead of a line, and using that to split the polygons

Defining the antimeridian in the CRS of the input data is the most complicated part of this process. In WGS84, defining a line at the antimeridian is easy. However, the CRS of the input data is in units of meters. The CRS info is the following:

I tried converting the antimeridian line in WGS84 to the Lambert Azimuthal projection, but the values that are not 0 become infinite. The closest solution I found so far, after trail and error of changing the coordinates and mapping, was to manually define the LineString with 2 coordinate values: line_geometry = LineString([(0, -20000000), (0, 20000000)]). The +/- 20000000 values are derived from mapping the polygons in the Lambert Azimuthal projection and checking the values on the x-axis (this line looks like the antimeridian when plotted, and this line looks like it splits the polygons where I would visually want them split, based on where the antimeridian crosses them).

clean_data_v2.py

```python # Clean permaice data from Brown et al. 2002 # Data source: https://nsidc.org/data/ggd318/versions/2 # Author: Juliet Cohen # Date: 2023-09-13 # Cleaning steps include: # 1. remove NA values from extent attribute # 2. add a dummy coding for the extent attribute # 3. split polygons that intersect the antimeridian # conda env: viz_3-10_local import geopandas as gpd import pandas as pd import numpy as np from shapely.geometry import LineString from shapely.ops import split input = "/home/jcohen/permafrost_ground_layer/data/permaice.shp" perm = gpd.read_file(input) # drop rows that have missing value for extent attribute perm.dropna(subset = ['EXTENT'], inplace = True) # create new gdf of just the polygons that have a neg min longitude and a pos max longitude # this indicates that the polygon crosses the antimeridian or prime meridian (7 polygons) meridian_polys = (perm['geometry'].bounds['minx'] < 0 ) & (perm['geometry'].bounds['maxx'] > 0) # subset the data to just polygons that cross either of the merdidians # and retain all attributes by appending `.copy()` prob_polys = perm[meridian_polys].copy() # subset the prob_polys to those that cross antimeridian, not prime meridian # the only polygon that crosses the prime meridian is at index 60 polys_AM = prob_polys.drop([60], inplace = False) # remove the antimeridian polygons from the original gdf # so when the split ones are appeneded later, there won't be duplicates perm.drop(polys_AM.index, inplace = True) # SPLIT THE ANTIMERIDIAN POLYGONS: # first create a line gdf at the 180th longitude, # that's where the center is in this dataset according to metadata, # but the units are meters, not degrees, so use 20,000,000 instead of 180 line_geometry = LineString([(0, -20000000), (0, 20000000)]) am = gpd.GeoDataFrame(geometry = [line_geometry]) # set the CRS to that of the polygons that need to be split am.set_crs(polys_AM.crs, inplace = True) # create empty lists to store the split polygons and their attributes geoms_all = [] atts_all = [] # iterate over each geometry that crosses the antimeridian for index, row in polys_AM.iterrows(): # define the geometry and attributes separately geom = row['geometry'] atts = row.drop('geometry') # append the atts to atts_all so we can append them to geoms after loop atts_all.append(atts) # return a boolean series that fits both the conditions: # 1) the line and geom do intersect, # 2) the line and geom do not touch (share a boundary) line = am.loc[(am.geometry.intersects(geom)) & (~am.geometry.touches(geom))] # split the single geometry at the antimeridian, # outputing multiple geometries stored within a "geometrycollection" per row split_polys = split(geom, line.geometry.iloc[0]) # add split polygons to the output list geoms_all.append(split_polys) # create a GeoDataFrame from the split polygons geoms_all_gdf = gpd.GeoDataFrame(geometry = geoms_all) geoms_all_gdf.set_crs(polys_AM.crs, inplace = True) # create a GeoDataFrame from the attributes of the split polygons atts_all_gdf = pd.DataFrame(atts_all, index = [0, 1, 2, 3, 4, 5]) # combine the split polys with the attributes into one gdf geoms_atts = pd.concat([geoms_all_gdf, atts_all_gdf], axis = 1) # separate the polygons from "geometrycollection" to individual polygon geoms # 5 of the polygons are split once, and 1 polygon is split twice split_polys_all = geoms_atts.explode(ignore_index = True, index_parts = False) # append the split polygons to the same gdf as other polygons perm = pd.concat([perm, split_polys_all], ignore_index = True) # add column that codes the categorical extent strings into numbers # in order to do stats with the workflow and assign palette to this # first, define the conditions and choices for new extent_code attribute conditions = [ (perm['EXTENT'] == "C"), (perm['EXTENT'] == "D"), (perm['EXTENT'] == "S"), (perm['EXTENT'] == "I") ] choices = [4, 3, 2, 1] # Use numpy.select() to assign extent_code based on the conditions and choices perm['extent_code'] = np.select(conditions, choices) # save the nonproblematic polygons as a cleaned input file for the viz workflow output_file = "/home/jcohen/permafrost_ground_layer/cleaned_data/permaice_clean_split.gpkg" perm.to_file(output_file, driver = "GPKG") print("Cleaning complete.") ```

julietcohen commented 1 year ago

Visuals for antimeridian polygon splitting

The first part of the script reads in the input data, removes NA values from the attribute we are visualizing, and identifies the polygons that intersect the antimeridan.

Part 1 of clean_data_v2.py

```python input = "/home/jcohen/permafrost_ground_layer/data/permaice.shp" perm = gpd.read_file(input) # drop rows that have missing value for extent attribute perm.dropna(subset = ['EXTENT'], inplace = True) # create new gdf of just the polygons that have a neg min longitude and a pos max longitude # this indicates that the polygon crosses the antimeridian or prime meridian (7 polygons) meridian_polys = (perm['geometry'].bounds['minx'] < 0 ) & (perm['geometry'].bounds['maxx'] > 0) # subset the data to just polygons that cross either of the merdidians # and retain all attributes by appending `.copy()` prob_polys = perm[meridian_polys].copy() # subset the prob_polys to those that cross antimeridian, not prime meridian # the only polygon that crosses the prime meridian is at index 60 polys_AM = prob_polys.drop([60], inplace = False) ... ```

The polygons visualized at this stage:

The next steps in the cleaning script create a line that represents the antimeridian and splits the polygons there.

Part 2 of clean_data_v2.py

```python # remove the antimeridian polygons from the original gdf # so when the split ones are appeneded later, there won't be duplicates perm.drop(polys_AM.index, inplace = True) # SPLIT THE ANTIMERIDIAN POLYGONS: # first create a line gdf at the 180th longitude, # that's where the center is in this dataset according to metadata, # but the units are meters, not degrees, so use 20,000,000 instead of 180 line_geometry = LineString([(0, -20000000), (0, 20000000)]) am = gpd.GeoDataFrame(geometry = [line_geometry]) # set the CRS to that of the polygons that need to be split am.set_crs(polys_AM.crs, inplace = True) # create empty lists to store the split polygons and their attributes geoms_all = [] atts_all = [] # iterate over each geometry that crosses the antimeridian for index, row in polys_AM.iterrows(): # define the geometry and attributes separately geom = row['geometry'] atts = row.drop('geometry') # append the atts to atts_all so we can append them to geoms after loop atts_all.append(atts) # return a boolean series that fits both the conditions: # 1) the line and geom do intersect, # 2) the line and geom do not touch (share a boundary) line = am.loc[(am.geometry.intersects(geom)) & (~am.geometry.touches(geom))] # split the single geometry at the antimeridian, # outputing multiple geometries stored within a "geometrycollection" per row split_polys = split(geom, line.geometry.iloc[0]) # convert the split polygons to a gdf in order to concatenate with the attributes #split_polys_gdf = gpd.GeoDataFrame(geometry = split_polys) #polys_with_atts = pd.concat([split_polys, atts], axis = 1) # add split polygons to the output list geoms_all.append(split_polys) # create a GeoDataFrame from the split polygons geoms_all_gdf = gpd.GeoDataFrame(geometry = geoms_all) geoms_all_gdf.set_crs(polys_AM.crs, inplace = True) # create a GeoDataFrame from the attributes of the split polygons atts_all_gdf = pd.DataFrame(atts_all, index = [0, 1, 2, 3, 4, 5]) # combine the split polys with the attributes into one gdf geoms_atts = pd.concat([geoms_all_gdf, atts_all_gdf], axis = 1) # separate the polygons from "geometrycollection" to individual polygon geoms # 5 of the polygons are split once, and 1 polygon is split twice split_polys_all = geoms_atts.explode(ignore_index = True, index_parts = False) ... ```

The polygons visualized at this stage:

Then we can test if this splitting worked for our intended purposes by converting the polygons at this stage to WGS84 and plotting the result in the same way.

transform

```python # convert the polygons to WGS84 to see how they are still deformed split_polys_all.to_crs("EPSG:4326", inplace = True) ```

They still wrap around the world the opposite way when converted to WGS84.

robyngit commented 1 year ago

@julietcohen, just exploring the CRS from this data...

perm.crs.name
# 'Sphere_ARC_INFO_Lambert_Azimuthal_Equal_Area'

perm.crs.to_epsg()
# None

More CRS Info

```python perm.crs.to_json_dict() # output: {'$schema': 'https://proj.org/schemas/v0.7/projjson.schema.json', 'type': 'ProjectedCRS', 'name': 'Sphere_ARC_INFO_Lambert_Azimuthal_Equal_Area', 'base_crs': {'name': 'GCS_Sphere_ARC_INFO', 'datum': {'type': 'GeodeticReferenceFrame', 'name': 'D_Sphere_ARC_INFO', 'ellipsoid': {'name': 'Sphere_ARC_INFO', 'radius': 6370997}}, 'coordinate_system': {'subtype': 'ellipsoidal', 'axis': [{'name': 'Longitude', 'abbreviation': 'lon', 'direction': 'east', 'unit': {'type': 'AngularUnit', 'name': 'Degree', 'conversion_factor': 0.0174532925199433}}, {'name': 'Latitude', 'abbreviation': 'lat', 'direction': 'north', 'unit': {'type': 'AngularUnit', 'name': 'Degree', 'conversion_factor': 0.0174532925199433}}]}}, 'conversion': {'name': 'unnamed', 'method': {'name': 'Lambert Azimuthal Equal Area (Spherical)', 'id': {'authority': 'EPSG', 'code': 1027}}, 'parameters': [{'name': 'Latitude of natural origin', 'value': 90, 'unit': {'type': 'AngularUnit', 'name': 'Degree', 'conversion_factor': 0.0174532925199433}, 'id': {'authority': 'EPSG', 'code': 8801}}, {'name': 'Longitude of natural origin', 'value': 180, 'unit': {'type': 'AngularUnit', 'name': 'Degree', 'conversion_factor': 0.0174532925199433}, 'id': {'authority': 'EPSG', 'code': 8802}}, {'name': 'False easting', 'value': 0, 'unit': 'metre', 'id': {'authority': 'EPSG', 'code': 8806}}, {'name': 'False northing', 'value': 0, 'unit': 'metre', 'id': {'authority': 'EPSG', 'code': 8807}}]}, 'coordinate_system': {'subtype': 'Cartesian', 'axis': [{'name': 'Easting', 'abbreviation': '', 'direction': 'east', 'unit': 'metre'}, {'name': 'Northing', 'abbreviation': '', 'direction': 'north', 'unit': 'metre'}]}} ```

This seems to be a projection without a EPSG code. If you try setting it to EPSG:9820 (perm.set_crs(9820, allow_override=True)), you get an error from the Proj library: CRSError: Invalid projection: EPSG:9820: (Internal Proj Error: proj_create: crs not found). I'm not sure 9820 is the identifier for this projection anyways. In their paper, the authors say they used the the Lambert Azimuthal Equal Area Polar Projection. The only information that I could find out about this projection really is that it is identified as SR-ORG:80, and someone had the exact issue you did with it in geopandas (see: geopandas issue: "to_crs error for polygons crossing the -180 degree meridian"). They indicated that reprojecting in ArcMap works.

I'm not sure, but I have a feeling that some of the projection information needed is not handled properly by geopandas (or more accurately: by the PROJ library that geopandas uses under the hood). I wonder if it's possible to re-project in ARC first.

Also, I don't know if this might help in splitting the polygons, but there is some prime meridian info in the CRS object:

perm.crs.prime_meridian
# PRIMEM["Greenwich",0, ANGLEUNIT["Degree",0.0174532925199433]]

julietcohen commented 1 year ago

Thanks for the insight, @robyngit! That helps clarify the confusing outputs and info (and lack of info) that I found online about this projection as well. I think you're right that there's information that geopandas can't handle, because an error is also returned from reprojecting an antimeridian line in WGS84 to the CRS of the input data.

reproduce error

```python # raw input data input = "/home/jcohen/permafrost_ground_layer/data/permaice.shp" perm = gpd.read_file(input) # drop rows that have missing value for extent attribute # (subsets the data to only the polygons of interest) perm.dropna(subset = ['EXTENT'], inplace = True) # create new gdf of just the polygons that have a neg min longitude and a pos max longitude # this indicates that the polygon crosses the antimeridian or prime meridian (7 polygons) meridian_polys = (perm['geometry'].bounds['minx'] < 0 ) & (perm['geometry'].bounds['maxx'] > 0) # subset the data to just polygons that cross either of the merdidians prob_polys = perm[meridian_polys].copy() # subset the prob_polys to those that cross antimeridian, not prime meridian # the only polygon that crosses the prime meridian is at index 60 polys_AM = prob_polys.drop([60], inplace = False) # create a LineString from the north pole to south pole along antimeridian line = LineString([(180, 90), (180, -90)]) # Create a GeoDataFrame with the line line = gpd.GeoDataFrame({'geometry': [line]}, crs = 'EPSG:4326') # reproject to the crs of input data line.to_crs(polys_AM, inplace=True) output: ValueError: The truth value of a GeoDataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ```

I will look into your suggestion to reproject this shapefile in ARCmap.

julietcohen commented 8 months ago

Update: splitting polygons to enable clean transformation to EPSG:4326

Converting the original data to EPSG:4326 in QGIS did not work because the uploaded file does not have a known EPSG. The Lambert projection is labeled as a custom CRS. QGIS is different than ArcMap (although the specific differences are unknown to me), so there is a chance that still may work in ArcMap, but I do not have access to ArcMap at this moment (maybe I can through a coworker at NCEAS).

Instead of doing that CRS conversion for the data file with that software, I continued to try to split the polygons with a buffer at the antimeridian.

Attempted programatic approaches

In order to split the polygons the cross the antimeridian, I tried to use geopandas and shapely. As noted earlier in this issue, simply splitting the polygons at a defined LineString for the antimeridian (in meters) did work in creating valid geometries that lied on either side of the meridian. However, when the geometries were then transformed into EPSG:4326 they were still deformed and wrapped the opposite way around the world likely because they were not buffered from the antimeridian, so the new split side of the polygon likely still intersected the antimeridian. It could be that my definition of the linestring for the antimeridian in meters was slightly off. I suggested that this may be able to be resolved by splitting the polygons with a buffered antimeridian linestring.

buffer AM line

``` python # read in polygons that cross AM polys = gpd.read_file("~/permafrost_ground_layer/split_AM_polys/AM_polygons.gpkg") # first create a line gdf at the 180th longitude, # that's where the center is in this dataset according to metadata, # but the units are meters, not degrees, so use 20,000,000 instead of 180 line_geometry = LineString([(0, -20000000), (0, 20000000)]) am = gpd.GeoDataFrame(geometry = [line_geometry]) # set the CRS to that of the polygons that need to be split am.set_crs(polys.crs, inplace = True) buffer = 600 am_buffered = am.buffer(buffer) # create empty lists to store the split polygons and their attributes geoms_all = [] atts_all = [] # iterate over each geometry that crosses the antimeridian for index, row in polys.iterrows(): # define the geometry and attributes separately geom = row['geometry'] atts = row.drop('geometry') # append the atts to atts_all so we can append them to geoms after loop atts_all.append(atts) # return a boolean series that fits both the conditions: # 1) the line and geom do intersect, # 2) the line and geom do not touch (share a boundary) line = am_buffered.loc[am_buffered.geometry.intersects(geom)] # split the single geometry at the buffered antimeridian, # outputing multiple geometries stored within a "geometrycollection" per row split_polys = split(geom, line.geometry.iloc[0]) # convert the split polygons to a gdf in order to concatenate with the attributes #split_polys_gdf = gpd.GeoDataFrame(geometry = split_polys) #polys_with_atts = pd.concat([split_polys, atts], axis = 1) # add split polygons to the output list geoms_all.append(split_polys) # create a GeoDataFrame from the split polygons geoms_all_gdf = gpd.GeoDataFrame(geometry = geoms_all) geoms_all_gdf.set_crs(polys.crs, inplace = True) geoms_all_gdf_exploded = geoms_all_gdf.explode().reset_index(drop = True) split_84 = geoms_all_gdf_exploded.to_crs(epsg="4326", inplace = False) ```

Unfortunately, an error resulted: GeometryTypeError: Splitting a Polygon with a Polygon is not supported I thought maybe QGIS can do this, but first wanted to try another programatic approach.

Next I tried to split the polygons with the original (not buffered) antimeridian LineString, then buffer the polygons's split side after they are split. But I could not figure out how to buffer one side of a polygon. I experimented with geopandas.buffer():

for split_geom in split_polys.geoms:
        if split_geom.intersects(line.geometry.iloc[0]):
            buffered_split = split_geom.buffer(buffer_distance, single_sided = True)
            geoms_all.append(buffered_split)

The output does not look buffered when mapped, even when the buffer distance is huge.

QGIS approach

While I have not 100% given up hope that geopandas.buffer() may be a programmatic solution, I tested if QGIS can split polygons with a buffered LineString, and it can! I uploaded 2 spatial files, the polygons that cross the antimeridian and the antimeridian in units of meters, and buffered the antimeridain 10,000 m. It can then split the polygons where the buffered line (polygon) is, and export the split polygons as a spatial file in the same "custom" Lambert projection.

Converting these geoms to EPSG:4326 shows no wrapping around the world:

julietcohen commented 8 months ago

I am not sure how to define an object or a LineString as PRIMEM["Greenwich",0, ANGLEUNIT["Degree",0.0174532925199433]], which Robyn found for the Lambert projection metadata. I tried a few approaches including line_geometry = polys.crs.prime_meridian but errors are returned like TypeError: Input must be valid geometry objects: Greenwich

julietcohen commented 7 months ago

I have come up with a new cleaning script that does the following steps programmatically:

ingests all the input data for the permafrost and ground ice dataset
remove NA values from extent attribute
defines the antimeridian in the projected CRS of the input data and buffers it by only 1 meter
splits the polygons that cross the antimeridian
appends those split polygons to the others with their attributes retained
add a dummy coding for the extent attribute
saves the GDF as a new file to use as input into the viz workflow

clean_perm_v3.py

```python # Clean permaice data from Brown et al. 2002 # Data source: https://nsidc.org/data/ggd318/versions/2 # Author: Juliet Cohen # Date: 2024-05-01 # Cleaning steps include: # 1. remove NA values from extent attribute # 2. add a dummy coding for the extent attribute # 3. split polygons that intersect the antimeridian # conda env: viz_3-10_local import geopandas as gpd import pandas as pd import numpy as np from shapely.geometry import LineString input = "/home/jcohen/permafrost_ground_layer/data/permaice.shp" perm = gpd.read_file(input) # drop rows that have missing value for extent attribute # since this is the attribute to visualize perm.dropna(subset = ['EXTENT'], inplace = True) # create new gdf of just the polygons that have a neg min longitude and a pos max longitude # this indicates that the polygon crosses the antimeridian or prime meridian (7 polygons) meridian_polys = (perm['geometry'].bounds['minx'] < 0 ) & (perm['geometry'].bounds['maxx'] > 0) # subset the data to just polygons that cross either of the merdidians # and retain all attributes by appending `.copy()` prob_polys = perm[meridian_polys].copy() # subset the prob_polys to those that cross antimeridian, not prime meridian # the only polygon that crosses the prime meridian is at index 60 polys_AM = prob_polys.drop([60], inplace = False) # remove the antimeridian polygons from the original gdf # so when the split ones are appeneded later, there won't be duplicates perm.drop(polys_AM.index, inplace = True) # Split polygons that cross the antimeridian # Step 1. create a line gdf at the 180th longitude, # that's where the center is in this dataset according to metadata, # the units are meters, not degrees, so use 20,000,000 instead of 180 am = gpd.GeoSeries(LineString([(0, -20000000), (0, 20000000)])) am.set_crs(perm.crs, inplace = True) # buffer the line with 1 meter (units of CRS) to convert it to a polygon am_buffered = am.buffer(distance = 1, cap_style = 2, join_style = 0, mitre_limit = 2) # create empty lists to store the split polygons and their attributes all_data = [] # iterate over each geometry that crosses the antimeridian for index, row in polys_AM.iterrows(): # define the geometry and attributes separately geom = row['geometry'] atts = gpd.GeoDataFrame(row.drop('geometry').to_frame().T) # split the single geometry with the buffered antimeridian GeoSeries, # outputing multiple geoms stored within a MultiPolygon split_geoms_MP = geom.difference(am_buffered) # make the index match the atts to concat correctly split_geoms_MP.index = atts.index # convert to GDF to define the geometry col split_geoms_MP_gdf = gpd.GeoDataFrame(geometry = split_geoms_MP) split_geoms_MP_gdf.set_crs(perm.crs, inplace = True) MP_with_atts = pd.concat([split_geoms_MP_gdf, atts], axis = 1) # MP_with_atts.reset_index(inplace = True) # not sure if i need this P_with_atts = MP_with_atts.explode(ignore_index = False, index_parts = False) # concatenate the exploded geometries with their attributes all_data.append(P_with_atts) # create empty gdf to store final result all_data_concat = gpd.GeoDataFrame() # iterate over each gdf in all_data, concatenate into single gdf for gdf in all_data: all_data_concat = pd.concat([all_data_concat, gdf], ignore_index = True) all_data_concat.reset_index(drop = True, inplace = True) # append the split polygons to the same gdf as other polygons perm = pd.concat([perm, all_data_concat], ignore_index = True) # add column that codes the categorical extent strings into numbers # in order to do stats with the workflow and assign palette to this # first, define the conditions and choices for new extent_code attribute conditions = [ (perm['EXTENT'] == "C"), (perm['EXTENT'] == "D"), (perm['EXTENT'] == "S"), (perm['EXTENT'] == "I") ] choices = [4, 3, 2, 1] # assign extent_code based on the conditions and choices perm['extent_code'] = np.select(conditions, choices) # save as a cleaned input file for the viz workflow output_file = "/home/jcohen/permafrost_ground_layer/split_AM_polys/cleaned_0501/permaice_clean_split.gpkg" perm.to_file(output_file, driver = "GPKG") print("Cleaning complete.") ```

In order to integrate this approach as a generalized step in viz-staging before staging any input data that contains geometries that cross the antimeridian, I'll need to find a way to generalize the way the antimeridian is defined. In this script, I define it as a LineString by explicitly defining the 2 coordinates. The LineString should ideally be defined by fetching the CRS of the input data and pulling the antimeridian from that. The following step will also buffer the LineString based on the minimum possible value for the distance argument of buffer(). I have not achieved this generalized approach yet, so this script will likely serve as the cleaning (pre-vizualization) for this dataset, and then for others we can integrate this functionality into viz-staging.

julietcohen commented 6 months ago

I visualized the dataset that was cleaned with the antimeridian polygons split as described above and uploaded it to the demo portal. I think the split polygons look great. The data was processed with max z-level set to 10. Processing the dataset for the final time will be with a higher z-level. The screenshot below is visualized with 65% opacity.

parsl_workflow.py

``` # Processing the permafrost and ground ice data layer # staging through web tiling with parsl # conda environment: viz_3-10_local, with local installs # for viz-staging and viz-raster import pdgstaging import pdgraster import config from datetime import datetime import json import logging import logging.handlers from pdgstaging import logging_config import os import parsl from parsl import python_app from parsl.config import Config from parsl.channels import LocalChannel from parsl.executors import HighThroughputExecutor from parsl.providers import LocalProvider import shutil import subprocess from subprocess import Popen user = subprocess.check_output("whoami").strip().decode("ascii") workflow_config = config.workflow_config # start with a fresh directory! print("Removing old directories and files...") base_dir = "/home/jcohen/permafrost_ground_layer/may_runs/" old_filepaths = [f"{base_dir}staging_summary.csv", f"{base_dir}raster_summary.csv", f"{base_dir}raster_events.csv", f"{base_dir}config__updated.json", f"{base_dir}log.log"] for old_file in old_filepaths: if os.path.exists(old_file): os.remove(old_file) # remove dirs from past run old_dirs = [f"{base_dir}staged", f"{base_dir}geotiff", f"{base_dir}web_tiles", f"{base_dir}runinfo"] for old_dir in old_dirs: if os.path.exists(old_dir) and os.path.isdir(old_dir): shutil.rmtree(old_dir) activate_conda = 'source /home/jcohen/.bashrc; conda activate viz_3-10_local' htex_local = Config( executors=[ HighThroughputExecutor( label="htex_local", worker_debug=False, #cores_per_worker=1, max_workers=10, provider=LocalProvider( #channel=LocalChannel(), init_blocks=1, max_blocks=1, worker_init=activate_conda ), ) ], ) parsl.clear() parsl.load(htex_local) def run_pdg_workflow( workflow_config, batch_size = 300 ): """ Run the main PDG workflow for the following steps: 1. staging 2. raster highest 3. raster lower 4. web tiling Parameters ---------- workflow_config : dict Configuration for the PDG staging workflow, tailored to rasterization and web tiling steps only. batch_size: int How many staged files, geotiffs, or web tiles should be included in a single creation task? (each task is run in parallel) Default: 300 """ start_time = datetime.now() logging.info("Staging initiated.") stager = pdgstaging.TileStager(workflow_config) tile_manager = stager.tiles config_manager = stager.config input_paths = stager.tiles.get_filenames_from_dir('input') input_batches = make_batch(input_paths, batch_size) # Stage all the input files (each batch in parallel) app_futures = [] for i, batch in enumerate(input_batches): app_future = stage(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(input_batches)}') # Don't continue to next step until all files have been staged [a.result() for a in app_futures] logging.info("Staging complete.") # ---------------------------------------------------------------- # Create highest geotiffs rasterizer = pdgraster.RasterTiler(workflow_config) # Process staged files in batches logging.info(f'Collecting staged file paths to process...') staged_paths = tile_manager.get_filenames_from_dir('staged') logging.info(f'Found {len(staged_paths)} staged files to process.') staged_batches = make_batch(staged_paths, batch_size) logging.info(f'Processing staged files in {len(staged_batches)} batches.') app_futures = [] for i, batch in enumerate(staged_batches): app_future = create_highest_geotiffs(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(staged_batches)}') # Don't move on to next step until all geotiffs have been created [a.result() for a in app_futures] logging.info("Rasterization highest complete. Rasterizing lower z-levels.") # ---------------------------------------------------------------- # Rasterize composite geotiffs min_z = config_manager.get_min_z() max_z = config_manager.get_max_z() parent_zs = range(max_z - 1, min_z - 1, -1) # Can't start lower z-level until higher z-level is complete. for z in parent_zs: # Determine which tiles we need to make for the next z-level based on the # path names of the geotiffs just created logging.info(f'Collecting highest geotiff paths to process...') child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1) logging.info(f'Found {len(child_paths)} highest geotiffs to process.') # create empty set for the following loop parent_tiles = set() for child_path in child_paths: parent_tile = tile_manager.get_parent_tile(child_path) parent_tiles.add(parent_tile) # convert the set into a list parent_tiles = list(parent_tiles) # Break all parent tiles at level z into batches parent_tile_batches = make_batch(parent_tiles, batch_size) logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.') # Make the next level of parent tiles app_futures = [] for parent_tile_batch in parent_tile_batches: app_future = create_composite_geotiffs( parent_tile_batch, workflow_config) app_futures.append(app_future) # Don't start the next z-level, and don't move to web tiling, until the # current z-level is complete [a.result() for a in app_futures] logging.info("Composite rasterization complete. Creating web tiles.") # ---------------------------------------------------------------- # Process web tiles in batches logging.info(f'Collecting file paths of geotiffs to process...') geotiff_paths = tile_manager.get_filenames_from_dir('geotiff') logging.info(f'Found {len(geotiff_paths)} geotiffs to process.') geotiff_batches = make_batch(geotiff_paths, batch_size) logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.') app_futures = [] for i, batch in enumerate(geotiff_batches): app_future = create_web_tiles(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(geotiff_batches)}') # Don't record end time until all web tiles have been created [a.result() for a in app_futures] end_time = datetime.now() logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: ' f'{end_time - start_time}') # ---------------------------------------------------------------- # Define the parsl functions used in the workflow: @python_app def stage(paths, config): """ Stage a file """ from datetime import datetime import json import logging import logging.handlers import os import pdgstaging from pdgstaging import logging_config stager = pdgstaging.TileStager(config = config, check_footprints = False) for path in paths: stager.stage(path) return True # Create highest z-level geotiffs from staged files @python_app def create_highest_geotiffs(staged_paths, config): """ Create a batch of geotiffs from staged files """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config # rasterize the vectors, highest z-level only rasterizer = pdgraster.RasterTiler(config) return rasterizer.rasterize_vectors( staged_paths, make_parents = False) # ---------------------------------------------------------------- # Create composite geotiffs from highest z-level geotiffs @python_app def create_composite_geotiffs(tiles, config): """ Create a batch of composite geotiffs from highest geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config rasterizer = pdgraster.RasterTiler(config) return rasterizer.parent_geotiffs_from_children( tiles, recursive = False) # ---------------------------------------------------------------- # Create a batch of webtiles from geotiffs @python_app def create_web_tiles(geotiff_paths, config): """ Create a batch of webtiles from geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config rasterizer = pdgraster.RasterTiler(config) return rasterizer.webtiles_from_geotiffs( geotiff_paths, update_ranges = False) def make_batch(items, batch_size): """ Create batches of a given size from a list of items. """ return [items[i:i + batch_size] for i in range(0, len(items), batch_size)] # ---------------------------------------------------------------- # run the workflow # run_pdg_workflow(workflow_config) if __name__ == "__main__": run_pdg_workflow(workflow_config) # Shutdown and clear the parsl executor htex_local.executors[0].shutdown() parsl.clear() # transfer log from /tmp to user dir cmd = ['mv', '/tmp/log.log', f'/home/{user}/permafrost_ground_layer/may_runs/'] # initiate the process to run that command process = Popen(cmd) print("Script complete.") ```

config.py

``` workflow_config = { "deduplicate_clip_to_footprint": False, "dir_input": "/home/jcohen/permafrost_ground_layer/cleaned_v3/", "ext_input": ".gpkg", "dir_staged": "/home/jcohen/permafrost_ground_layer/may_runs/staged/", "dir_geotiff": "/home/jcohen/permafrost_ground_layer/may_runs/geotiff/", "dir_web_tiles": "/home/jcohen/permafrost_ground_layer/may_runs/web-tiles/", "filename_staging_summary": "/home/jcohen/permafrost_ground_layer/may_runs/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/permafrost_ground_layer/may_runs/rasterization_events.csv", "filename_rasters_summary": "/home/jcohen/permafrost_ground_layer/may_runs/rasters_summary.csv", "tms_id": "WGS1984Quad", "z_range": [ 0, 10 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "coverage_category", "weight_by": "area", "property": "extent_code", "aggregation_method": "max", "resampling_method": "mode", "val_range": [1,4], "palette": [ "#17c0d3", "#179bd3", "#1775d3", "#154abc" ], "nodata_val": 0, "nodata_color": "#ffffff00" } ], "deduplicate_method": None, "deduplicate_at": None } ```

julietcohen commented 6 months ago

Polygons south of 45 degrees latitude are absent in tilesets

Brown et al. dataset in QGIS

Polygons with one of the 4 categories of the EXTENT value (not NA) are present below 45 degrees latitude in the original data from Brown et al. When I visualize the data in the viz workflow, the output tilesets seem to be cut off at 45 degrees south. I wonder why this is.

Here is the Brown et al. data visualized in QGIS, with a palette that shows dark blue as continuous extent (C), lighter blues for less continuous extents (D and I and S), and white for other (NA value).

Brown et al. dataset in tilesets

In the PDG demo portal, the polygons are cut off at 50 degrees latitude when zoomed out, and 45 degrees latitude when zoomed in, where the lower polygons appear only when you zoom in. No matter the zoom level, a clear latitudinal line serves as a cut off for the polygons.

Zooming in, we also see a few undesired vertical/longitudinal "stripes" that extend from 45 degrees more south, originating from the southernmost part of the polygons. These stripes very closely resemble the bands we saw stretch around the world horizontally/latitudinally when the polygons that crossed the antimeridian and became distorted when converted to WGS84. Note that these vertical stripes only appear when very zoomed in, they are not present when zoomed out. Northern Washington state is pictured here.

julietcohen commented 6 months ago

There is potential for the limiting factor to be in the bounding box created in viz-staging here. Perhaps the bottom of the bounding box is hard coded to a certain latitude somehow.

julietcohen commented 6 months ago

In viz-staging, we use geopandas total_bounds here to determine the bounding box of the TMS. In a notebook, I imported the input data for the viz workflow (the gpkg output from the cleaning script) and converted it to EPSG 4326, and ran gdf.total_bounds. The output: array([-179.99997917, 27.42921194, 179.99997917, 83.6228101 ]), which has a miny of ~27 so it should include polygons that extend south of 45 degrees latitude. I think a good next step would be to start the viz workflow again, and check these variables grid and bounds to determine that the bbox is what we want

julietcohen commented 6 months ago

I subset the cleaned data (output of the cleaning script) to just a handful of polygons (those that are ~30 degrees latitude) to prove that the viz workflow can process polygons that far south. Here they are in local cesium:

PermafrostDiscoveryGateway / pdg-portal