Display Ingmar Nitze's lake change dataset

robyngit commented 2 years ago

This dataset is not yet ready, but a sample has been uploaded to datateam.nceas.ucsb.edu:/home/pdg/data/nitze_lake_change/data_sample_2022-09-09

Sample data info

The files are not final, but the general structure will be the same.

General structure

* <UTM_zone>
  * 05_Lake_Dataset_Raster_02_final
    * [files]

The data sample directory contains 3 directories, one per UTM zone
- Each UTM directory contains another dir (e.g. 5_Lake_Dataset_Raster_02_final)
- Within this sub-dir there are several files (gpkg, geojson, tif)
- The "lake_change.gpkg" are polygon vector files with a lot of attributes about shape and the spatial/temporal dynamics)
- There are also raster images which are spatially aggregated. e.g. "lake_change_grid_3000_netchange.tif" resolution is 3000m, there are also files for 7500m resolution
- They are in absolute values (lake area change in ha, per pixel). In a next version it might make sense to normalize that per pixel.
UTM zones will overlay, need deduplication between them
There are also 2 QGIS layer styles (.qml) for a nice visualization.
- lake_change_rates_net_10cl_v3tight shows absolute changes, negative values (red) for loss, positive values (blue) for growth.
There are also styled images examples with the CartoDB Black Matter base layer.

robyngit commented 2 years ago

I processed the sample data from Ingmar using the same workflow we created for the IWP polygons, creating both PNG web tiles and 3D tiles. Everything ran very smoothly. The output is currently displayed on the demo portal:

Notes:

Here is the config that I used for the workflow: ingmar-config.json.txt
The colour palette should be improved. At the least, rather than fading to light blue & white where there are few lakes, we should fade to transparency. Better, we should try to replicate the styling examples that Ingmar provided.
I set the max-zoom level to 11, but given that these are mostly very large polygons, we could consider changing that to 10
This data is comparatively much faster to process than the IWP data (fewer and larger polygons)
For the 3D tiles, we might want to try selecting a better geometric error, as it would be nice for some of the larger lakes to appear while more zoomed out. (Ideally, we should updated the 3D tile tree workflow to create separate tiles based on the size of the polygons, such that larger polygons start to load first, and smaller ones load once you're most zoomed in)

initze commented 1 year ago

New data Package

Download Data (2.3GB)
Processed UTM zones 32603-32617

initze commented 1 year ago

Image visualization suggestions

3D tiles ('lake_change.gpkg')
- Attribute: "ChangeRateNet_myr-1"
- palette RdYlBu
- QGIS - style file

This styling works nicely with a black background map (CartoDB Dark Matter, or similar)

grafik grafik

initze commented 1 year ago

Image visualization suggestions part 2

Raster

File: lake_change_grid_3000_netchange.tif lake_change_grid_3000_netchange.tif

They are similar to Webb et al., 2022 (Surface water index Trend) Please just use the raster as is, it's already designed to have aggregated statistics per pixel.

Palette: RdBu Range: -2:+2 NoData = 0

grafik grafik

julietcohen commented 1 year ago

New data added

Ingmar uploaded 5 zip files that contain lake change data to a Google Drive folder here.

The data is separated by UTM zones. Within each of those 5 directories is 6-10 subfolders that represents one UTM zone each.
The 5 directories range in size from 1.22 GB - 10.75 GB. The smallest is data_products_32635-32640

Per our visualization meeting discussion on 4/3, the highest priority is to process the data in one of the 5 directories, taking Ingmar's color suggestions into consideration. Next, we will move onto processing the other 4 directories and finally Ingmar's newer data, documented in issue#37.

Update: These 5 directories have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/data_2022_11_04/lake_change_GD

julietcohen commented 1 year ago

Large quantity of staged tiles from 1 input lake change file

Initially tried to stage all 6 lake_change.gpkg files within data_products_32635-32640, but staging was taking hours and was producing a surprising amount of files from just one of the input gpkg file: data_products_32635-32640/32640/lake_change.gpkg, so restarted the process as a tmux script to only stage this one UTM zone.

log shows that all files staged are from that one file, shows no errors or unexpected output
it's only 77 MB, contains 58,495 polygons
output: >68,000 staged tiles (wow), running for >4 hours (update: tmux session continued running overnight and is at >230,000 tiles.)
using main branch of viz-staging
processing on NCEAS datateam server, not in parallel
note: usually a few hundred to a few thousand tiles result from 1 average sized IWP input gpkg / shp file (although the area covered for 1 IWP file is smaller than the area covered by 1 lake file)

script

```python # Process smallest directory in Ingmar's Google Drive # issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28 # follow color scheme suggested in issue # data: https://drive.google.com/drive/folders/1JxSBRs2nikB_eYxtwEatZ06bhusAN5nL # intent is to process all the data files in this drive, # then move on to process Ingmar's higher temporal resolution data # using venv arcade_layer with local installs for: # viz-staging, viz-raster, & viz-3dtiles # imports ----------------------------------------------------- # input data import from pathlib import Path # staging import pdgstaging from pdgstaging import TileStager # rasterization import pdgraster from pdgraster import RasterTiler # visual checks import geopandas as gpd # logging from datetime import datetime import logging import logging.handlers import os # data --------------------------------------------------------- base_dir = Path('/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640') filename = 'lake_change.gpkg' input = [p.as_posix() for p in base_dir.glob('**/' + filename)] print(f"Input file(s): {input}") # logging config ----------------------------------------------- handler = logging.handlers.WatchedFileHandler( os.environ.get("LOGFILE", "/home/jcohen/lake_change_GD_workflow/log.log")) formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) root = logging.getLogger() root.setLevel(os.environ.get("LOGLEVEL", "INFO")) root.addHandler(handler) # Staging ------------------------------------------------------ print("Staging...") stager = TileStager({ "deduplicate_clip_to_footprint": False, "dir_input": "/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#ff0000", # red (check that these color codes are appropriate) "#0000ff" # blue (source: http://web.simmons.edu/~grovesd/comm244/notes/week3/css-colors) ], "nodata_val": 0, "nodata_color": "#ffffff00" # change? }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "neighbor", "deduplicate_keep_rules": [["staging_filename", "larger"]], "deduplicate_overlap_tolerance": 0.1, "deduplicate_overlap_both": False, "deduplicate_centroid_tolerance": None }) for file in input: print(f"Staging file {file}...") stager.stage(file) print(f"Completed staging file {file}.") print("Staging complete.") ```

We love a suspenseful mystery!

julietcohen commented 1 year ago

Ray workflow on Delta server for 1 UTM zone

1 GPU node (because only 1 input file)
changed max z-level from 15 to 11 because this data is ~30m resolution, not sub-meter like the IWP data, so no need to go down to such high resolution
created new config file, but used same IN_PROGRESS_VIZ_WORKFLOW.py and other processing scripts with adjustments
Red & blue color palette

Staging

produced 6,448 files at z-level 11
took 17 minutes

Raster highest

produced 6,448 files at z-level 11
took 23 minutes
memory flux around 14%
cpu flux around 20-50%

Raster lower

total # geotiffs for all z-levels produced: 9,428
took 1 minute

Web tiles

total # pngs for all z-levels produced: 9,428
took 1 minute

Web tile visualization on local Cesium:

julietcohen commented 1 year ago

To do:

change the property used for the statistic to ChangeRateNet_myr-1
changing the property will likely require us to adjust other config options as well:
- potentially weight_by (depending on units of ChangeRateNet_myr-1), but likely keep this at area rather than changing to count when using ChangeRateNet_myr-1
- aggregation_method to mode so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel
- updates:
  - mode is not an option. During rasterization step for highest z-level, log reports: ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'. See options here.
  - max and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
```
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract  x = asanyarray(arr - arrmean)
```
    and
```
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)
```
    These statements are likely resulting from errors in rows of the raster_summary.csv with invalid cell values of inf and -inf, represented like so:
- potentially resampling_method to mode so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent
- adjusting the config for a property of interest was described in this issue
adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a gradient from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above here. The color palette must be accepted by colormaps.
visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well
once the palette and statistic is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)

julietcohen commented 1 year ago

Progress towards adapting colors to attribute of interest

Ran the workflow through web-tiling with statistic ran on property of the data, rather than the coverage as normal for IWP data. Used the attribute ChangeRateGrowth_myr-1 because I was testing if using an attribute with all positive values would resolve the inf and -inf values shown in the raster_summary.csv pictured above. Unfortunately, there were still many inf and -inf values produced, which resulted in failure to produce many lower resolution rasters from z-11 rasters. I plotted anyway to get an idea of how the colors appear when mapped to this attribute. I used a color palette with 7 colors that range from red to yellow to blue depending on the lake growth:

Config is here.

Notes:

I did the run on Delta because it's much faster than Datateam, especially rasterization steps
need to determine how to get around the inf and -inf values

julietcohen commented 1 year ago

Infinity and NaN values present in input data

I determined the source of those inf and -inf values: they are present in the input data. I back traced them from the raster summary to the staged tiles and finally the 36 input lake_change.gpkg files, each from one UTM zone. There are also NaN values in each gpkg. These values might represent no data, or could be errors.

After discovering inf and NaN values within one file, I ran a script to check for them in every file, see below.

check_inf_nan_values.py

``` # sum up inf and NaN values in lake change input data import geopandas as gpd import pandas as pd import numpy as np from pathlib import Path # collect all lake_change.gpkg filepaths in Ingmar's data base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/') filename = 'lake_change.gpkg' # To define each .gpkg file within each subdir as a string representation with forward slashes, use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in base_dir.glob('**/' + filename)] # 36 input files print(f"Collected {len(input)} lake_change.gpkg filepaths.") # import each filepath as a gdf # convert to df by removing geometry col # (necessary because numpy functions needs a df as input) # sum up all inf values in df # sum up all NaN values in df # add sums to appropriate lists # create empty list for # inf values in each file inf_values = [] nan_values = [] for path in input: print(f"Checking file {path}.") gdf = gpd.read_file(path) # convert from gdf to df df = gdf.drop(['geometry'], axis = 1) # sum up inf values in df num_inf_values = np.isinf(df).values.sum() inf_values.append(num_inf_values) # sum up NaN values in df num_nan_values = df.isnull().sum().sum() nan_values.append(num_nan_values) print(f"Checks complete.\nLength of inf value list is {len(inf_values)}.\nLength of Nan value list is {len(nan_values)}.") # write lists to files with open("/home/jcohen/lake_change_GD_workflow/LC_inf_values.txt", "w") as output: for inf_value in inf_values: output.write(f"{inf_value}\n") with open("/home/jcohen/lake_change_GD_workflow/LC_nan_values.txt", "w") as output: for nan_value in nan_values: output.write(f"{nan_value}\n") ```

output: LC_inf_values.txt (number of inf values in each file)

``` 437 422 1300 357 225093 409 1075 104 1182 1369 239 321 105 48 216 60 382 406 948 266 331 421 991 805 438 570 465 230 467 391 399 478 269 431 227 365 ```

Output: LC_nan_values.txt (number of NaN values in each file)

``` 33 26 68 25 16350 23 61 8 40 69 7 23 5 2 2 0 12 16 22 8 23 17 13 35 26 20 19 6 13 11 17 24 7 9 7 13 ```

These values are present in several columns, including but not necessarily limited to:

NetChange_perc
GrossIncrease_perc
GrossDecrease_perc
ChangeRateNet_myr-1
ChangeRateGrowth_myr-1

@initze : Would you recommend that I remove rows with inf and NaN values, or replace these values with something else? Alternatively, would you like to re-process the data to avoid calculating these values in the first place?

julietcohen commented 1 year ago

Ingmar is looking into the source of the inf and NaN values, and he will get back to us about how he wants to move forward.

initze commented 1 year ago

Hi @julietcohen . I checked a few of the files and could find some of the affected polygons. It seems that the files didn't run through a specific filter. The affected features/lakes are all very small, thus running into 0 divisions and other stuff.

Solution

delete rows with NaN for now I will apply a filter in the next version.

I hope that fixes your issue

Cheers Ingmar

To do:

* change the `property` used for the `statistic` to `ChangeRateNet_myr-1`

* changing the `property` will likely require us to adjust other config options as well:

  * potentially `weight_by` (depending on units of `ChangeRateNet_myr-1`), but likely keep this at `area` rather than changing to `count` when using `ChangeRateNet_myr-1`
  * `aggregation_method` to `mode` so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel

    * updates:

      * `mode` is not an option. During rasterization step for highest z-level, log reports: `ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'`. See options [here](https://sparkbyexamples.com/pandas/pandas-aggregate-functions-with-examples/#:~:text=What%20are%20pandas%20aggregate%20functions,form%20a%20single%20summary%20value.).
      * `max` and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:

/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract  x = asanyarray(arr - arrmean)

and

/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)

These statements are likely resulting from errors in rows of the raster_summary.csv with invalid cell values of inf and -inf, represented like so: invalid

* potentially `resampling_method` to `mode` so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent

* adjusting the config for a property of interest was described in [this issue](https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/9)

* adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a _gradient_ from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above [here](https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28#issuecomment-1275936775). The color palette must be accepted by [colormaps](https://pratiman-91.github.io/colormaps).

* visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well

* once the `palette` and `statistic`  is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)

julietcohen commented 1 year ago

Thanks for looking into this, Ingmar. I'll remove all rows with NaN, or inf values and move forward with processing this data with a statistic for the ChangeRateNet_myr-1 attribute.

robyngit commented 1 year ago

@tcnichol is ready to start moving this data to our datateam server. We need to discuss where he will store the data and how he will transfer it from delta (globus)? Estimated to be around 500GB, including a lot of intermediary products.

mbjones commented 1 year ago

He should store it in the same ~pdg/data staging directory that we've been using for IWP. Juliet had trouble getting globus to write directly there, which is why the ~jscohen account was created. There is also a pdg account, and we should talk to Nick to clarify if that can be used to do a globus transfer directly into that location - I think it should work. Let's discuss on slack.

julietcohen commented 1 year ago

@mbjones: Todd is curious if there is an update on the Globus --> Datateam data transfer situation. If we have enabled this for users without needing to give them access to jscohen, please let us know how he can do so.

mbjones commented 1 year ago

I talked with Nick about moving the home directory of pdg to be /home/shares/pdg, which would mean its home directory is on the same fiesystem that is used for our PDG data storage. That would enable globus logins with the pdg account shared by the project to be able to transfer data directly to where it needs to be. That should eliminate the need for the jscohen account altogether. Check with Nick on the status of those changes.

julietcohen commented 1 year ago

Great, thank you

julietcohen commented 1 year ago

Update on cleaning lake change data before visualization:

Cleaning the lake change data provided in November 2022, located in: /home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/...

Ingmar requested that the rows of each of the 46 lake_change.gpkg that contain any NA or inf value be removed prior to visualization, since I noted that this causes issues for the rasterization step in which we are calculating stats on an attribute of the input data.

In order to document the rows that contain NA and/or inf values and therefore need to be removed, the following script writes a csv file for each lake_change.gpkg with the rows (and their indices) that contain NA or inf values. These are saved within a directory with a hierarchy that matches Ingmar's hierarchy so it's clear which csv documents the invalid rows for to which originallake_change.gpkg.

clean_lake_change_data.py

```python # Author: Juliet Cohen # Overview: # identify, document, and remove rows with an invalid value in any column (Na, inf, or -inf) # in lake change input data from Ingmar Nitze, # and save the cleaned files to a new directory # using conda env perm_ground import geopandas as gpd import pandas as pd import numpy as np from pathlib import Path import os # collect all lake_change.gpkg filepaths in Ingmar's data base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/') filename = 'lake_change.gpkg' # To define each .gpkg file within each subdir as a string representation with forward slashes, # use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in base_dir.glob('**/' + filename)] print(f"Collected {len(input)} lake_change.gpkg filepaths.") # Overview of loop: # 1. import each filepath as a gdf # 2. document which rows have invalid value in any column (Na, inf, or -inf) # as a separate csv for each input gpkg # 3. drop any row with an invalid value # 4. save as new lake change file for path in input: print(f"Checking file {path}.") gdf = gpd.read_file(path) # first identify any rows with invalid values # to document which will be dropped for data visualization error_rows = [] # first convert any infinite values to NA gdf.replace([np.inf, -np.inf], np.nan, inplace = True) for index, row in gdf.iterrows(): if row.isna().any(): error_rows.append(row) error_rows_df = pd.DataFrame(error_rows) # hard-code the start of the path to directory for the erroneous data filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/error_data_documentation/" # next, pull the last couple parts of filepath to ID which lake_change.gpkg # is being processed, following Ingmar's directory hierarchy directory, filename = os.path.split(path) filepath_sections = directory.split(os.sep) relevant_sections = filepath_sections[-2:] partial_filepath = relevant_sections[0] + "/" + relevant_sections[1] full_filepath = filepath_start + partial_filepath + "/error_rows.csv" # make the subdirectories if they do not yet exist directory_path = os.path.dirname(full_filepath) if not os.path.exists(directory_path): os.makedirs(directory_path) # save the df of rows with invalid values as a csv # save the index because communicates to Ingmar which rows in his original data # contain invalid values error_rows_df.to_csv(full_filepath, index = True) print(f"Saved rows with invalid values for lake change GDF:\n{path}\nto file:\n{full_filepath}") # drop the rows with NA values in any column gdf.dropna(axis = 0, inplace = True) # save cleaned lake change file to new directory # hard-code the start of the path to directory for the cleaned data filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/cleaned_files/" # next, pull the last couple parts of filepath to ID which lake_change.gpkg # is being processed, following Ingmar's directory hierarchy directory, filename = os.path.split(path) filepath_sections = directory.split(os.sep) relevant_sections = filepath_sections[-2:] + ['lake_change_cleaned.gpkg'] filepath_end = relevant_sections[0] + "/" + relevant_sections[1] + "/" + relevant_sections[2] full_filepath = filepath_start + filepath_end print(f"Saving file to {full_filepath}") # make the subdirectories if they do not yet exist directory_path = os.path.dirname(full_filepath) if not os.path.exists(directory_path): os.makedirs(directory_path) gdf.to_file(full_filepath, driver = "GPKG") print(f"Cleaning complete.") ```

julietcohen commented 1 year ago

Visualizing 2 cleaned adjacent UTM zones

Ideally for the config, we would set both aggregation_method and resampling_method to mode when visualizing the attribute ChangeRateNet_myr-1 but we must use something else besides mode for aggregation_method until that functionality is built into the workflow. For testing, we can try max.

This attribute ranges from approximately -10 to +5 with most values ranging from -2 to +2 (for the 2 UTM zones used for testing):

Therefore we should also include a val_range in the config [-2,2]. This keeps the palette consistent across z-levels, so lakes or portions of lakes do not change color as you zoom in. More extreme values that fall outside that range are assigned the color for -2 or 2.

To do:

remove lake change polygons that fall over the ocean by masking the data before inputting into workflow (Ingmar can also do this before passing data to us, he is looking for a mask that also has rivers)
change the no data value to a value not found in the data such as -999 or 999
change the aggregation to min rather than max because most of the data is negative trend for this attribute (min agregation should be paired with +999 no data value and max aggregation should be paired with -999 data value)
check how the no data values are treated in the workflow when resampling: we want to avoid lakes disappearing at lower z-levels because 4 pixels (one of which may be a no-data value) are combined into 1 and all 4 pixels become transparent
add documentation to the aggregation and resampling options in viz-staging
test if the strip we see between UTM zones is due to the neighbors deduplication method or due to the input data

julietcohen commented 1 year ago

Adjusting config option `nodata_value` and removing deduplication

The last few rounds of lake change visualization for the same UTM zones tested the following two changes in the config:

setting nodata_value to 999 and None (instead of 0 which is used for IWP data, because 0 is a possible value for the visualized attribute in this lake change data)
not deduplicating (used same staged data with flagged duplicates using the neighbor deduplication method, but re-rasterized and web-tiled with config set to not deduplicate so flagged duplicates are not actually removed)
using min instead of max for the aggregation_method

lake_change_config.py

```python IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": 999, "nodata_color": "#ffffff00" # fully transparent white }, ], #"deduplicate_at": ["raster"], "deduplicate_at": None, #"deduplicate_keep_rules": [["Date","larger"]], "deduplicate_method": None, #"deduplicate_overlap_tolerance": 0.1, #"deduplicate_overlap_both": False, #"deduplicate_centroid_tolerance": None } ```

Results

1. setting `nodata_value` to `999` and `None`

Both of these nodata_values were problematic, resulting in a yellow layer over the tiles that spans from the prime meridian to the antimeridian:

This may be related to the fact that these values 999 and None are not present in the data at all, so the nodata_mask is of length 0 when we execute to_image() in viz-raster here. Because the nodata_value is not set to 0 anymore, the 0 values in the data (yellow in our palette) are not made into the mask, so they are retained. Consider the IWP config, where we do set the nodata_value to 0. The cells that have 0% coverage or 0 polygons are set to transparent. However, this still does not explain why the yellow layer expands all the way around half of the world, and is not limited to the region of the 2 UTM zones. The same dir of staged tiles is used to produce both web tile sets that do and do not have the yellow layer, so I suppose this means that the yellow layer is only in the web-tiles, and not the staged tiles themselves. More testing is needed to narrow down what's happening here.

2. Not removing flagged duplicates

In the web tiles, we still see the gap that I assume represents the border of the two UTM zones visualized:

The next step to figure out what's going on here is to plot all the lake data for both UTM zones and see if they overlap. Ingmar expressed that there certainly is overlap and there needs to be deduplication integrated either before the viz-workflow or during the viz-workflow.

3. using `min` instead of `max` for aggregation

This makes more sense considering that the most extreme values of the distribution for this attribute is negative, rather than positive. Hopefully this more correctly conveys the environmental trends shown in the data and shows more diverse values in the palette.

julietcohen commented 1 year ago

Update on deduplication

After looking into the code for the neighbors deduplication approach, as well as plotting the input data overlap, here are some observations:

there is indeed a strip of overlap in the input files as Ingmar observed at the border of the two UTM zones that is the same blank strip we see pictured above
there are no True values for the staging_duplicated attribute in the staged tiles
there is no Date attribute in the input data
this means that the likely source of the deduplication mystery is the fact that in my configuration file I used the same deduplicate_keep_rules as the IWP dataset ([["Date","larger"]]) when I should use a property that is in the data
the property I should use for the deduplicate_keep_rules should not be the property I'm are visualizing, because I can choose between larger and smaller (see here), and this property has both positive and negative values, so would be more appropriate to be [["Perimeter_meter","larger"]] which has all positive values
there should be an error raised in the ray workflow if the property specified within deduplicate_keep_rules is not present in the data (to be determined where this error should go exactly, and if the workflow is already set up to raise an error in this scenario, why it was not raised in the ray workflow specifically)

julietcohen commented 1 year ago

Deduplication Successful

Changing the deduplicate_... config options corrected the deduplication:

lake_change_config.py

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": 0, "nodata_color": "#ffffff00" # fully transparent white }, ], "deduplicate_at": ["raster"], "deduplicate_keep_rules": [["Perimeter_meter","larger"]], # [property, operator], using property with all positive values "deduplicate_method": "neighbor", "deduplicate_overlap_tolerance": 0.5, # default value "deduplicate_overlap_both": False, # only 1 polygon must be overlapping with the deduplicate_overlap_tolerance threshold to be considered dups "deduplicate_centroid_tolerance": None # only deduplicate_overlap_tolerance will be used to determine if polygons are dups } ```

Python indexing syntax for `neighbor` dedup method

Output in the terminal during staging brought my attention to some syntax that could potentially be improved in the neighbor method.

The message might be referring to this syntax. The link to the pandas documentation that describes the better syntax is here.

julietcohen commented 1 year ago

I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment, but used np.nan for the nodata_val in the config (and made sure to import numpy as np at the top of the config)

lake_change_config.py

``` IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": np.nan, "nodata_color": "#ffffff00" # fully transparent white }, ], "deduplicate_at": ["raster"], "deduplicate_keep_rules": [["Perimeter_meter","larger"]], # [property, operator], using property with all positive values "deduplicate_method": "neighbor", "deduplicate_overlap_tolerance": 0.5, # default value "deduplicate_overlap_both": False, # only 1 polygon must be overlapping with the deduplicate_overlap_tolerance threshold to be considered dups "deduplicate_centroid_tolerance": None # only deduplicate_overlap_tolerance will be used to determine if polygons are dups } ```

The result is still this strange yellow layer that also resulted from setting nodata_val to 999 and None:

The ConfigManager.py specifes that the nodata_val should be able to be set to an integer, float, None, or np.nan.

If we do use 0 as the nodata_val here, which is the only value I have tried so far that doesn't result in that yellow layer, we would be using it the same way we do for the IWP data in the sense that in both datasets palettes we are not differentiating between regions where no data was collected and regions that have data but no polygons were identified/lakes size changed.

robyngit commented 1 year ago

I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment

Have you tried setting the nodata_val before rasterization? You could dig through the code to check if this is in fact how it works, but theoretically the workflow should set the pixels without any polygons to the no_dataval when first creating the highest level geotiffs.

julietcohen commented 1 year ago

Thanks for the feedback @robyngit! I'll re-start the workflow at the raster highest step rather than web-tiling, with no-data val set to 999 or np.Nan and see if that works.

I didn't think that the nodata value would be used in the raster highest step because searching for nodata_val within viz-raster shows that the only occurrences are when web tiles are created. Within RasterTiler.py, nodata_val occurs when we execute webtile_from_geotiff(). However, this was just a quick string search on Github rather than a deep dive into how you use nodata_val throughout the package, so it's totally possible that it's used earlier in the raster highest step and I'm missing it.

julietcohen commented 1 year ago

Maybe the raster highest step gets the nodata_val from get_raster_config() defined in viz-staging which is called within rasterize_vector() here

robyngit commented 1 year ago

Looks like I made it so that no data pixels are always zero in the the Raster class 😕, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L772-L773

Grid cells without any data will be filled with 0.

We would have to make it so that the 0 in the fill_value=0 line uses the nodata_val from the config, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L809

julietcohen commented 1 year ago

Update on `nodata_val`

In line with Robyn's observation, which I did not see until after I tried this, I re-rasterized the same deduplicated staged tiles for these 2 UTM zones, so started the workflow at raster highest, with the nodata_val set to 999. The resulting web tiles still have the yellow haze (see demo portal). I will open a branch to integrate that change to enable non-0 nodata_vals.

julietcohen commented 8 months ago

Ingmar helpfully created a new issue to document his data cleaning to remove seawater and rivers: https://github.com/PermafrostDiscoveryGateway/landsattrend-pipeline/issues/8

initze commented 6 months ago

Temporary pan-arctic dataset

cleaned (removed rivers + seawater)
- minor issues may still be present
non-deduplicated
- intersections between zones may contain (almost) duplicates
- needs to be finally implemented yet#### Link
  Link
  
  https://drive.google.com/drive/folders/18pC-FW9Nibmkcv7DPlzzT3YW4Aim0k7C?usp=sharing
  
  Content
Lakes_global.parquet - geoparquet file with 20yr lake change params
lakes_netShorelineChange.qml/sld - layer style files in sld and qml/qgis format to visualize lake shore erosion values (style that I often use)

julietcohen commented 6 months ago

Thank you, @initze ! Well done.

This data has been uploaded to Datateam at: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03 along with a README file. I included Ingmar's notes above, as well as my notes and links to the relevant github issues.

julietcohen commented 5 months ago

I did an initial check of the new, filtered version of the lake change dataset: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03/Lakes_global-001.parquet

There are NA values present in the attributes ChangeRateNet_myr-1 and ChangeRateGrowth_myr-1. As I did for the previous version of the lake change data that contained NA values for certain attributes, I can clean this data by removing the entire row if it contains NA values for an attribute.

I focused on this dataset today because it makes sense to use this data for 4 different but related goals for the PDG project, which are of varying degrees of priority:

process this dataset into a layer
deduplicate this dataset for Ingmar using the neighbor method
- and enable him to execute the neighbor deduplication himself, ideally outside of the viz workflow, see viz-staging issue#36)
process lake data, Ingmar's geometries, with 2 custom stats in the viz workflow (coverage and number of polygons) to demonstrate what we can already do with summary data so Kat and Ian can build on that
explore how parquet files differ from other vector data and how we can automate the viz workflow to use them as input

julietcohen commented 5 months ago

Visualized Sample of Parquet Data (filtered for seawater and rivers)

I used a subset of 10,000 polygons of the Lake Change data in parquet format as input into the visualization workflow with 3 stats:

ChangeRateNet_myr-1 which is an attribute in Ingmar's data
area_per_pixel_area, a custom stat in the viz workflow
centroids_per_pixel, a custom stat in the viz workflow

They are up on the demo portal.

config

``` workflow_config = { "dir_input": "/home/jcohen/lake_change_parquet", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 13 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "mean", "resampling_method": "mode", "val_range": [ 0, 1 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise "#0000ff"], # blue "nodata_val": 0, "nodata_color": "#ffffff00" }, { "name": "lake_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": ["#20f8c6", # light turquoise "#2f20f8"], # dark blue "nodata_val": 0, "nodata_color": "#ffffff00" }, { "name": "lake_count", "weight_by": "count", "property": "centroids_per_pixel", "aggregation_method": "sum", "resampling_method": "sum", "palette": ["#20f8c6", # light turquoise "#2f20f8"], # dark blue "nodata_val": 0, "nodata_color": "#ffffff00" } ], "deduplicate_clip_to_footprint": False, "deduplicate_method": None } ```

`ChangeRateNet_myr-1`

`area_per_pixel_area` (% coverage)

`centroids_per_pixel` (number of lakes)

Zooming in, the pixels that represent the center of each lake are so high res (and therefore so small) that they are hard to see.

Notes

For the stats ChangeRateNet_myr-1 and area_per_pixel_area, the polygons appear to be blocky and low resolution even though the max z-level is 13
- this was not the case with the "original" lake change data, that seemed to have high res and rounded shapes
- for reference see the layer Lake Change Rate layer that was pre-filtering for seawater and rivers pictured here:

In the config, we can specify the simplify_tolerance, which I set to 0.1. This can be much higher, such as 0.0001, and see if this resolves it. If the simplification tolerance does fix this issue, it would be worth investigating the minimum or maximum tolerance that should be allowed based on the max zoom level specified in the config. For example, considering the max zoom level set as the z_range, then the workflow should give a warning, an error, or automatically change the resolution of the tolerance based on what it should be.
- The nodata_val is set to 0 in the viz workflow (see viz-raster issue#16)
- The palette that Ingmar suggested for ChangeRateNet_myr-1 works well
- As seen in other workflow runs with different datasets that use a palette with different colors rather than a range of transparency, the coverage category shows the pixels on the edges of all polygons as lighter (turquoise) and the inner polygons as darker (deep blue), because the polygons on the inside completely cover the pixels, and the polygon edges that partially cover the border pixels are <100% coverage
- the presence of what appears to be a buffer polygon around this island makes me wonder if the filtering of seawater needs to be more comprehensive:

julietcohen commented 5 months ago

I created a larger sample of 20,000 polygons and reprocessed the data with a much smaller simplify_tolerance = 0.0001. This succeeded in retaining the lake shapes of the polygons.

The larger sample size highlighted that there are other polygons that appear to be seawater in addition to the island one pictured above in red in the last comment. So maybe the filtering could be refined.

julietcohen commented 3 months ago

Update on deduplication and filtering seawater

Since I resolved this issue, here's the path forward for processing the lake change data:

Todd and Ingmar will filter the data for seawater and rivers to the best of their ability for this first version of the dataset, which they expect to have done in the next few days, and it's ok if there are leftover seawater polygons for now!
Todd or Ingmar will apply the duplicate flagging approach outlined here to each pair of adjacent UTM zones, and then remove the rows that are True for staging_duplicated before passing the data to me
It would be helpful if Todd and Ingmar could remove rows with NA values for the attribute of interest, ChangeRateNet_myr-1 as well as any other attributes that Ingmar or Anna wants visualized from this dataset
If there are any multipolygons in the data that is passed to me to use as input into the workflow, usually I would explode those polygons into singular ones with copied attributes for the new rows as part of the pre-viz processing. This is because if we input geometries of any kind besides single polygons then the workflow will not process those geometries. However, this dataset contains attributes related to area and perimeter of the polygons, which means that those attributes will not be accurate for those exploded polygons, so in my opinion that would not be good practice for this dataset. This is noted here.
- If Ingmar prefers to keep these multipolygons in the tilesets output from the workflow, the best way to do that for now may be for him to explode the multipolygon geometries himself, then make the values NA for those new rows for the attributes that cannot be accurately copied over into the news rows. For example, for a multipolygon the attribute ChangeRateNet_myr-1 would be able to correctly translate to the new rows of exploded singular polygons, but the area attribute should not be copied over, so that should be converted to NA
If the dataset is passed onto me as a netCDF file and a geopackage file, then I will use the script outlined here to merge the data and save it as a geopackage before using it as input into the viz workflow
I will input it into the visualization workflow with the config set to not dedup, since that was already done, and have the no-data value set as 0 for the attribute ChangeRateNet_myr-1 for this first version of the dataset
To my knowledge, the data package does not exist on the ADC yet, so if it does exist, please point me to it. If not, please make a package and coordinate with @justinkadi.

Updates on dataset metrics

Ingmar helpfully provided some big picture dataset info:

Question	Answer
How many lakes are mapped?	about 6-6.2 million.
Total dataset size?	the uncleaned, merged dataset is 4.13GB
Total area of the lakes?	the uncleaned, merged dataset is ~950,000 km² but it will become smaller once deduplication is applied
For the derived lake drainage dataset, would we include a binary attribute data for simply "this lake is present or drained", or would it be an ordered categorical attribute including "partially drained", "totally drained", and "not drained"?	"Thats a tricky one. drained must typically refer to some reference date, as lakes are often polycyclic. Ideally we would have a drainage year. We have an attribute "NetChange_perc" which actually provides that information. In a paper I am working on we have >25% <= 75% loss as partially drained, >75% as completely drained. However change must be > 1 ha as small lakes/changes are quite uncertain"
For the derived lake drainage dataset, would the geometry be in the form of polygons (the perimeter of the lake) or a point?	Polygons. For other analysis or visualization we can still reduce to centroid/points or something similar

julietcohen commented 2 months ago

Deduplication update

After some mapping in QGIS, and without doing systematic checks to confirm for sure, polygons that border or intersect the antimeridian seem to have been identified in Ingmar's lake change dataset when the neighbors deduplication approach was applied to 2 adjacent UTM zones, 32601-32602. UTM zone 32601 borders the antimeridian. The neighbors deduplication approach transforms the geometries from their original CRS into the one specified as distance_crs so the anitmeridian polygons are distorted and wrap the opposite way around the world. See this screenshot of the distorted 32601 (green) and 32602 (yellow)

This results in polygons in zone 32602 overlapping spatially with polygons from 32601 that are not actually in the region of overlap. Here's a screenshot from Jonas (a collaborator with Ingmar and Todd) who mapped the polygons that were identified as duplicates in red: 32601_32602

Mapping the flagged duplicate polygons on top of the distorted zone 32601 shows suspicious overlap: image (1)

Next steps:

If this is intersection with the antimeridian is indeed the cause of the incorrectly flagged duplicates (those that lie outside of the overlap between 2 adjacent UTM zones), then the neighbor deduplication should work well for all zones that do not intersect the antimeridian. Todd is looking into applying to approach to all UTM zones, and figuring out how to get around it for zone 32601. My recommendation is to identify which polygons do cross the antimeridian, split them, and buffer them slightly away from the line. Code for this was applied to the permafrost and ground ice dataset here.

iona5 commented 2 months ago

Hi, this is Jonas from AWI.

I was initially asked by Todd regarding the files but i was not sure about the problem. But your post motivated me to look into this a bit further. I am not familiar with the details of the de-duplication but i understand that you reproject the UTM layers (32601, 32601) to EPSG:3857 to find the spatially overlaps which causes the vertices on the other side of the antimeridian to wrap around?

I guess you should try to project both datasets into a CRS where the coordinates do not wrap around at the antimeridian for these areas. I for example used EPSG:3832 to create the very screenshot you showed above to get rid of the polygon distortion. I didn't realize that this was actually the problem, so i didn't even mention this to Todd. I am optimistic if you project into EPSG:3832 you'll get the correct results for the UTM zones bordering the antimeridian.

Another option would be to just declare the datasets to be a non-critical UTM zone (for example 32603 and 32604) basically translating all polygons to the east before reprojection, so the longitude coordinates do not wrap around in EPSG:3857. I guess the inaccuracy is negligible. After de-duplication project back to 32603/04 and then translate back to 32601/02. But i guess the first option is preferable.

Sorry if I misunderstood the problem or if i am totally off track here.

julietcohen commented 2 months ago

Hi Jonas, thanks for your suggestions! You are correct that the CRS is configurable for the deduplication step, and I simply used the default projected CRS in my example script that I passed off to Todd. Todd and Ingmar wanted an example of how to execute the deduplication with adjacent UTM zones before the data is input into the visualization workflow. Given my example script and explicit parameters, your team can change the parameters as you see fit.

The transformation to the new CRS, EPSG 3857, is indeed what causes the geometries to become deformed. However, this CRS is not the only CRS that causes this issue, and the deduplication step is not the only time we transform the data in the workflow. I have encountered similarly deformed geometries in other datasets when converting to EPSG 4326 (not projected), which we do during the initial standardization of the data during the "staging" step. That unprojected CRS is required for the Tile Matrix Set of our viz workflow's output geopackage and geotiff tilesets. This means that you may be able to deduplicate the lake change data prior to the visualization workflow with a different CRS and retain the original lake geometries that border the antimeridian, but they will likely be problemic in the same way when we create tilesets from them, requiring buffering from the antimeridian as a "cleaning"/preprocessing step before staging anyway.

Plots in python

Original CRS

import geopandas as gpd
import matplotlib.pyplot as plt

gdf = gpd.read_file("~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg")

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)

4326 causes deformed geometries

gdf_4326 = gdf.to_crs(epsg = 4326, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_4326.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)

3857 deforms geometries

gdf_3857 = gdf.to_crs(epsg = 3857, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3857.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)

3832

gdf_3832 = gdf.to_crs(epsg = 3832, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3832.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)

This last plot shows the data transformed into your suggested CRS, and as you suggested it shows no wrapping around the world, however the description of that CRS makes me wonder if the transformation to that CRS would slightly deform the geometries since it appears that the suggested use area for that CRS does not contain UTM zone 32601. But please correct me if I am wrong. But this amount of deformation may likely be negligible.

I don't follow your last suggestion fully, but it does sound like a function I know exists in R, ST_ShiftLongitude.

Resources:

Documentation for the neighbor deduplication approach can be found here. The source code is here.

iona5 commented 2 months ago

Your concerns make sense (but i would personally argue the 3857 isn't really precise either but that's another can of worms 😉 ).

I might find the time to look into this further, but for now i guess in the end it boils down to the origin of the reference system. From what i remember from my GIS lectures:

In most predefined reference systems with roughly global coverage the longitude/x-origin is located at the 0° meridian with an extent of -180° to +180° in the lon-axis. If you project a polygon overlapping the antimeridian into these CRS, some vertices of those polygons will be transformed to having an x-coordinate close to -180 and others close to +180 (if in degrees). That is what happens here.

However, this is just a way the coordinate extent is defined and that can be changed. One can simply modify it to set the origin close to the area of interest. You can observe those in the CRS config (I think its most convenient and insightful to look at the PROJ4 string):

EPSG:4326: +proj=longlat +datum=WGS84 +no_defs +type=crs EPSG:3857: +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs EPSG:3832: +proj=merc +lon_0=150 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs

note 4326 and 3857 use a +lon_0 value of 0, while 3852 uses a +lon_0 valueof 150. That means it sets its origin to 150° east (see https://proj.org/en/9.4/usage/projections.html). So in that case the x-coordinates of the vertices of the polygons in the lake_change dataset are all set around the +30° value which corresponds to the antimeridian.

So if we want to stick to EPSG:3857 we can use it as a base:

For example in QGIS you can define this custom CRS (in Settings -> Custom projections), +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs which is just EPSG:3857 with the origin rotated (+lon_0=180) and it displays both files just fine.

So if you transform both GeoDataFrames into that custom CRS before getting the intersection i think it might work and you are technically still in 3857, just the origin is different:

import pyproj
import geopandas as gpd

gdf = gpd.read_file("lake_change_32601.gpkg")

crs_3857_rotated = pyproj.CRS.from_proj4("+proj=merc +a=6378137 +b=6378137 "
   "+lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 "
   "+units=m +nadgrids=@null +wktext +no_defs")

gdf_3857r = gdf.to_crs(crs_3857_rotated, inplace = False)

gdf_3857r.plot(cmap = 'viridis', linewidth = 0.8)

Note that for production one might consider not using the proj4 string for the definition of the custom CRS, but the WKT string, which is presumably more precise: https://proj.org/en/9.4/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems . the Proj library provides conversion tools, the setting corresponing to lon_0 is imho

...
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
...

but i didn't test that.

I looked in your code to test the deduplication with this, but it seems like the projection is done beforehand (i.e. outside of deduplicate_neighbors() ) and i hadn't had the time to mock something up. But i guess if you use this custom CRS for data in the UTM zones close to the antimeridian, it will work.

julietcohen commented 2 months ago

Thanks so much for your thorough explanation of your idea, that's an interesting approach! Since the plan is for this deduplication to be completed by @tcnichol prior to passing the data off to me for visualization and standardization, maybe he can read through our suggestions and choose how to move forward.

I'd like to clarify when the CRS transformations take place. Since deduplication will occur before staging for this dataset, the transformation within the duplicate flagging process is the first time. Within deduplicate_neighbors, the parameter distance_crs defines the CRS that the data will be transformed into. The transformation occurs here, and note that it is done on the geometry points to check the distance between the centroids and identify the duplicates. The next time the data is transformed is here during staging, after you pass off the data to me.

Lastly, I'd like to highlight that flagging the duplicates and removing the duplicates are different steps. When we flag the duplicates, we create a boolean column, which I set to be called staged_duplicated for the parameter prop_duplicated in deduplicate_neighbors, to indicate if each polygons is a duplicate (True) or not (False). Your team is familiar with this, because the example script I provided outputs the file 32617-32618_dups_flagged.gpkg. So as a final step before passing the data to me, simply remove the rows that are True in each GDF.

julietcohen commented 2 months ago

Identify which polygons intersect the 180th longitude line and split them

Before we determine which approach to take to deal with these polygons that cross the antimeridian, either by splitting them with a buffered the 180th degree longitude or creating a custom CRS with a different meridian, I want to first answer the question: Are the polygons that intersect the antimeridian lake detections or simply seawater and river polygons that are already supposed to be filtered out before deduplication and visualization anyway?

I did this exploration in R. The plots show that all but 1 polygons that intersect the antimeridian are seawater or rivers, which should be removed by Todd and Ingmar's filtering prior to deduplication and visualization.

explore_32601.R

```r # Author: Juliet Cohen # Date: 2024-08-05 # Explore the relationship between lakes in 32601 and 180th degree longitude # PART 1) example of how to split polygons with buffered antimeridian # PART 2) subset and plot the polygons to just those that intersect the # antimeridian, helping determine if most or all of these are seawater # and river polys, rather than lake polys which will be the actual input # data to the viz-workflow library(sf) library(ggplot2) library(leaflet) # interactive mapping library(mapview) # PART 1 ----------------------------------------------------------------------- # Split polygons at the buffered antimeridian # UTM zone 32601, which has NOT been filtered for seawater or rivers yet fp = "~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg" gdf_32601 <- st_read(fp) %>% st_set_crs(32601) # transform data to EPSG:4326, WGS84 gdf_4326 <- st_transform(gdf_32601, 4326) # define the max and min y values to limit antimeridian line to relevant area bbox <- st_bbox(gdf_4326) ymin = bbox["ymin"] ymax = bbox["ymax"] # create a line of longitude at 180 degrees (antimeridian) in EPSG:4326 AM <- st_linestring(matrix(c(180, 180, ymin, ymax), ncol = 2)) %>% st_sfc(crs = 4326) # plot polygons (in 32601) with AM line (in 4326) ggplot() + geom_sf(data = gdf_32601) + geom_sf(data = AM, color = "red", size = 1) + # adjust the background grid to either CRS, 4326 or 32601 # coord_sf(crs = st_crs(32601), datum = st_crs(32601)) + labs(title = "Lakes in UTM 32601 with 180° longitude line") + theme_minimal() # buffer the antimeridian line buffered_AM <- st_buffer(AM, dist = 0.1) buffered_AM_32601 <- st_transform(buffered_AM, 32601) # split the polygons with the buffered AM split_polys <- st_difference(gdf_32601, buffered_AM_32601) # convert split polygons to 4326 because this is required by leaflet split_polys_4326 <- st_transform(split_polys, 4326) map <- leaflet(split_polys_4326) %>% addTiles() %>% # add default OpenStreetMap map tiles addPolygons() map # PART 2 ----------------------------------------------------------------------- # Determine which polygons actually cross the antimeridian # Both geosaptial objects MUST be within the same CRS, # so use the unbuffered antimeridian, but convert it to CRS 32601. # NOTE: cannot go the other way (transform the polys to 4326), # because they wrap the other way around the world, # because have not yet been split yet AM_32601 <- st_transform(AM, 32601) intersect_polys <- gdf_32601[st_intersects(gdf_32601, AM_32601, sparse = FALSE), ] # split the intersecting polygons with the buffered AM defined earlier in script, # because need the buffer to move polygon side away from the antimeridian intersect_split_polys <- st_difference(intersect_polys, buffered_AM_32601) intersect_split_polys_4326 <- st_transform(intersect_split_polys, 4326) intersect_map <- leaflet(intersect_split_polys_4326) %>% addTiles() %>% addPolygons() intersect_map ```

julietcohen commented 2 months ago

Note that the same exploration should be done for the UTM zone on the other side of the 180th degree longitude and for any zones further south that also touch the meridian may be included in the analysis

julietcohen commented 1 month ago

Todd will first apply the seawater and river filtering to the lake change data (including the final step of removing the rows where the polygon has been flagged as False for within_land), then will apply the antimeridian polygon splitting to the relevant UTM zones (for the few polygons that need it), then will apply the deduplication, then it is my understanding that the data can be passed to me for visualization.

julietcohen commented 1 month ago

Notes from a meeting today, Aug 16th are in the PDG meeting notes. They outline the next steps for Ingmar and Todd to process all UTM zones with filtering, deduplication, and merging, then validate the data with the geohash ID's. Regarding the lake polygons that intersect the antimeridian, Todd is not sure how many are in the dataset. Ingmar suggested that if they stick to polar projections for all steps for data processing prior to the visualization workflow, then the intersection the antimeridian will not be a problem at all. While this is true, this will still be a problem when the data is input into the viz workflow (as noted above here as well) because we use EPSG:4326. I emphasize this because if there are lake polygons that intersect the antimeridian then whoever runs the viz workflow on this data will need to split those polygons prior to processing with one of the methods I documented above.

PermafrostDiscoveryGateway / pdg-portal