PermafrostDiscoveryGateway / pdg-portal

Design and mockup documents for the PDG portal
Apache License 2.0
0 stars 0 forks source link

Display Ingmar Nitze's lake change dataset #28

Open robyngit opened 2 years ago

robyngit commented 2 years ago

Sample data info

The files are not final, but the general structure will be the same.

General structure

* <UTM_zone>
  * 05_Lake_Dataset_Raster_02_final
    * [files]
robyngit commented 2 years ago

I processed the sample data from Ingmar using the same workflow we created for the IWP polygons, creating both PNG web tiles and 3D tiles. Everything ran very smoothly. The output is currently displayed on the demo portal:

Screen Shot 2022-09-21 at 14 35 34 Screen Shot 2022-09-21 at 14 36 37

Notes:

initze commented 1 year ago

New data Package

initze commented 1 year ago

Image visualization suggestions

  1. 3D tiles ('lake_change.gpkg')

This styling works nicely with a black background map (CartoDB Dark Matter, or similar)

grafik grafik

initze commented 1 year ago

Image visualization suggestions part 2

Raster

File: lake_change_grid_3000_netchange.tif lake_change_grid_3000_netchange.tif

They are similar to Webb et al., 2022 (Surface water index Trend) Please just use the raster as is, it's already designed to have aggregated statistics per pixel.

Palette: RdBu Range: -2:+2 NoData = 0

grafik grafik

julietcohen commented 1 year ago

New data added

Ingmar uploaded 5 zip files that contain lake change data to a Google Drive folder here.

Per our visualization meeting discussion on 4/3, the highest priority is to process the data in one of the 5 directories, taking Ingmar's color suggestions into consideration. Next, we will move onto processing the other 4 directories and finally Ingmar's newer data, documented in issue#37.


Update: These 5 directories have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/data_2022_11_04/lake_change_GD

julietcohen commented 1 year ago

Large quantity of staged tiles from 1 input lake change file

Initially tried to stage all 6 lake_change.gpkg files within data_products_32635-32640, but staging was taking hours and was producing a surprising amount of files from just one of the input gpkg file: data_products_32635-32640/32640/lake_change.gpkg, so restarted the process as a tmux script to only stage this one UTM zone.

script ```python # Process smallest directory in Ingmar's Google Drive # issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28 # follow color scheme suggested in issue # data: https://drive.google.com/drive/folders/1JxSBRs2nikB_eYxtwEatZ06bhusAN5nL # intent is to process all the data files in this drive, # then move on to process Ingmar's higher temporal resolution data # using venv arcade_layer with local installs for: # viz-staging, viz-raster, & viz-3dtiles # imports ----------------------------------------------------- # input data import from pathlib import Path # staging import pdgstaging from pdgstaging import TileStager # rasterization import pdgraster from pdgraster import RasterTiler # visual checks import geopandas as gpd # logging from datetime import datetime import logging import logging.handlers import os # data --------------------------------------------------------- base_dir = Path('/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640') filename = 'lake_change.gpkg' input = [p.as_posix() for p in base_dir.glob('**/' + filename)] print(f"Input file(s): {input}") # logging config ----------------------------------------------- handler = logging.handlers.WatchedFileHandler( os.environ.get("LOGFILE", "/home/jcohen/lake_change_GD_workflow/log.log")) formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) root = logging.getLogger() root.setLevel(os.environ.get("LOGLEVEL", "INFO")) root.addHandler(handler) # Staging ------------------------------------------------------ print("Staging...") stager = TileStager({ "deduplicate_clip_to_footprint": False, "dir_input": "/home/jcohen/lake_change_GD_workflow/lake_change_GD/data_products_32635-32640/32640", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#ff0000", # red (check that these color codes are appropriate) "#0000ff" # blue (source: http://web.simmons.edu/~grovesd/comm244/notes/week3/css-colors) ], "nodata_val": 0, "nodata_color": "#ffffff00" # change? }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "neighbor", "deduplicate_keep_rules": [["staging_filename", "larger"]], "deduplicate_overlap_tolerance": 0.1, "deduplicate_overlap_both": False, "deduplicate_centroid_tolerance": None }) for file in input: print(f"Staging file {file}...") stager.stage(file) print(f"Completed staging file {file}.") print("Staging complete.") ```

We love a suspenseful mystery!

julietcohen commented 1 year ago

Ray workflow on Delta server for 1 UTM zone

Staging

Raster highest

Raster lower

Web tiles

Web tile visualization on local Cesium:

image image image
julietcohen commented 1 year ago

To do:

julietcohen commented 1 year ago

Progress towards adapting colors to attribute of interest

Ran the workflow through web-tiling with statistic ran on property of the data, rather than the coverage as normal for IWP data. Used the attribute ChangeRateGrowth_myr-1 because I was testing if using an attribute with all positive values would resolve the inf and -inf values shown in the raster_summary.csv pictured above. Unfortunately, there were still many inf and -inf values produced, which resulted in failure to produce many lower resolution rasters from z-11 rasters. I plotted anyway to get an idea of how the colors appear when mapped to this attribute. I used a color palette with 7 colors that range from red to yellow to blue depending on the lake growth:

image

Config is here.

Notes:

julietcohen commented 1 year ago

Infinity and NaN values present in input data

I determined the source of those inf and -inf values: they are present in the input data. I back traced them from the raster summary to the staged tiles and finally the 36 input lake_change.gpkg files, each from one UTM zone. There are also NaN values in each gpkg. These values might represent no data, or could be errors.

After discovering inf and NaN values within one file, I ran a script to check for them in every file, see below.

check_inf_nan_values.py ``` # sum up inf and NaN values in lake change input data import geopandas as gpd import pandas as pd import numpy as np from pathlib import Path # collect all lake_change.gpkg filepaths in Ingmar's data base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/') filename = 'lake_change.gpkg' # To define each .gpkg file within each subdir as a string representation with forward slashes, use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in base_dir.glob('**/' + filename)] # 36 input files print(f"Collected {len(input)} lake_change.gpkg filepaths.") # import each filepath as a gdf # convert to df by removing geometry col # (necessary because numpy functions needs a df as input) # sum up all inf values in df # sum up all NaN values in df # add sums to appropriate lists # create empty list for # inf values in each file inf_values = [] nan_values = [] for path in input: print(f"Checking file {path}.") gdf = gpd.read_file(path) # convert from gdf to df df = gdf.drop(['geometry'], axis = 1) # sum up inf values in df num_inf_values = np.isinf(df).values.sum() inf_values.append(num_inf_values) # sum up NaN values in df num_nan_values = df.isnull().sum().sum() nan_values.append(num_nan_values) print(f"Checks complete.\nLength of inf value list is {len(inf_values)}.\nLength of Nan value list is {len(nan_values)}.") # write lists to files with open("/home/jcohen/lake_change_GD_workflow/LC_inf_values.txt", "w") as output: for inf_value in inf_values: output.write(f"{inf_value}\n") with open("/home/jcohen/lake_change_GD_workflow/LC_nan_values.txt", "w") as output: for nan_value in nan_values: output.write(f"{nan_value}\n") ```
output: LC_inf_values.txt (number of inf values in each file) ``` 437 422 1300 357 225093 409 1075 104 1182 1369 239 321 105 48 216 60 382 406 948 266 331 421 991 805 438 570 465 230 467 391 399 478 269 431 227 365 ```
Output: LC_nan_values.txt (number of NaN values in each file) ``` 33 26 68 25 16350 23 61 8 40 69 7 23 5 2 2 0 12 16 22 8 23 17 13 35 26 20 19 6 13 11 17 24 7 9 7 13 ```

These values are present in several columns, including but not necessarily limited to:

@initze : Would you recommend that I remove rows with inf and NaN values, or replace these values with something else? Alternatively, would you like to re-process the data to avoid calculating these values in the first place?

julietcohen commented 1 year ago

Ingmar is looking into the source of the inf and NaN values, and he will get back to us about how he wants to move forward.

initze commented 1 year ago

Hi @julietcohen . I checked a few of the files and could find some of the affected polygons. It seems that the files didn't run through a specific filter. The affected features/lakes are all very small, thus running into 0 divisions and other stuff.

Solution

delete rows with NaN for now I will apply a filter in the next version.

I hope that fixes your issue

Cheers Ingmar

To do:

* change the `property` used for the `statistic` to `ChangeRateNet_myr-1`

* changing the `property` will likely require us to adjust other config options as well:

  * potentially `weight_by` (depending on units of `ChangeRateNet_myr-1`), but likely keep this at `area` rather than changing to `count` when using `ChangeRateNet_myr-1`
  * `aggregation_method` to `mode` so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel

    * updates:

      * `mode` is not an option. During rasterization step for highest z-level, log reports: `ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'`. See options [here](https://sparkbyexamples.com/pandas/pandas-aggregate-functions-with-examples/#:~:text=What%20are%20pandas%20aggregate%20functions,form%20a%20single%20summary%20value.).
      * `max` and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract  x = asanyarray(arr - arrmean)

and

/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)

These statements are likely resulting from errors in rows of the raster_summary.csv with invalid cell values of inf and -inf, represented like so: invalid

* potentially `resampling_method` to `mode` so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent

* adjusting the config for a property of interest was described in [this issue](https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/9)

* adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a _gradient_ from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above [here](https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28#issuecomment-1275936775). The color palette must be accepted by [colormaps](https://pratiman-91.github.io/colormaps).

* visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well

* once the `palette` and `statistic`  is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)
julietcohen commented 1 year ago

Thanks for looking into this, Ingmar. I'll remove all rows with NaN, or inf values and move forward with processing this data with a statistic for the ChangeRateNet_myr-1 attribute.

robyngit commented 1 year ago

@tcnichol is ready to start moving this data to our datateam server. We need to discuss where he will store the data and how he will transfer it from delta (globus)? Estimated to be around 500GB, including a lot of intermediary products.

mbjones commented 1 year ago

He should store it in the same ~pdg/data staging directory that we've been using for IWP. Juliet had trouble getting globus to write directly there, which is why the ~jscohen account was created. There is also a pdg account, and we should talk to Nick to clarify if that can be used to do a globus transfer directly into that location - I think it should work. Let's discuss on slack.

julietcohen commented 1 year ago

@mbjones: Todd is curious if there is an update on the Globus --> Datateam data transfer situation. If we have enabled this for users without needing to give them access to jscohen, please let us know how he can do so.

mbjones commented 1 year ago

I talked with Nick about moving the home directory of pdg to be /home/shares/pdg, which would mean its home directory is on the same fiesystem that is used for our PDG data storage. That would enable globus logins with the pdg account shared by the project to be able to transfer data directly to where it needs to be. That should eliminate the need for the jscohen account altogether. Check with Nick on the status of those changes.

julietcohen commented 1 year ago

Great, thank you

julietcohen commented 1 year ago

Update on cleaning lake change data before visualization:

Cleaning the lake change data provided in November 2022, located in: /home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/...

Ingmar requested that the rows of each of the 46 lake_change.gpkg that contain any NA or inf value be removed prior to visualization, since I noted that this causes issues for the rasterization step in which we are calculating stats on an attribute of the input data.

In order to document the rows that contain NA and/or inf values and therefore need to be removed, the following script writes a csv file for each lake_change.gpkg with the rows (and their indices) that contain NA or inf values. These are saved within a directory with a hierarchy that matches Ingmar's hierarchy so it's clear which csv documents the invalid rows for to which originallake_change.gpkg.

clean_lake_change_data.py ```python # Author: Juliet Cohen # Overview: # identify, document, and remove rows with an invalid value in any column (Na, inf, or -inf) # in lake change input data from Ingmar Nitze, # and save the cleaned files to a new directory # using conda env perm_ground import geopandas as gpd import pandas as pd import numpy as np from pathlib import Path import os # collect all lake_change.gpkg filepaths in Ingmar's data base_dir = Path('/home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/') filename = 'lake_change.gpkg' # To define each .gpkg file within each subdir as a string representation with forward slashes, # use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in base_dir.glob('**/' + filename)] print(f"Collected {len(input)} lake_change.gpkg filepaths.") # Overview of loop: # 1. import each filepath as a gdf # 2. document which rows have invalid value in any column (Na, inf, or -inf) # as a separate csv for each input gpkg # 3. drop any row with an invalid value # 4. save as new lake change file for path in input: print(f"Checking file {path}.") gdf = gpd.read_file(path) # first identify any rows with invalid values # to document which will be dropped for data visualization error_rows = [] # first convert any infinite values to NA gdf.replace([np.inf, -np.inf], np.nan, inplace = True) for index, row in gdf.iterrows(): if row.isna().any(): error_rows.append(row) error_rows_df = pd.DataFrame(error_rows) # hard-code the start of the path to directory for the erroneous data filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/error_data_documentation/" # next, pull the last couple parts of filepath to ID which lake_change.gpkg # is being processed, following Ingmar's directory hierarchy directory, filename = os.path.split(path) filepath_sections = directory.split(os.sep) relevant_sections = filepath_sections[-2:] partial_filepath = relevant_sections[0] + "/" + relevant_sections[1] full_filepath = filepath_start + partial_filepath + "/error_rows.csv" # make the subdirectories if they do not yet exist directory_path = os.path.dirname(full_filepath) if not os.path.exists(directory_path): os.makedirs(directory_path) # save the df of rows with invalid values as a csv # save the index because communicates to Ingmar which rows in his original data # contain invalid values error_rows_df.to_csv(full_filepath, index = True) print(f"Saved rows with invalid values for lake change GDF:\n{path}\nto file:\n{full_filepath}") # drop the rows with NA values in any column gdf.dropna(axis = 0, inplace = True) # save cleaned lake change file to new directory # hard-code the start of the path to directory for the cleaned data filepath_start = "/home/jcohen/lake_change_GD_workflow/workflow_cleaned/cleaned_files/" # next, pull the last couple parts of filepath to ID which lake_change.gpkg # is being processed, following Ingmar's directory hierarchy directory, filename = os.path.split(path) filepath_sections = directory.split(os.sep) relevant_sections = filepath_sections[-2:] + ['lake_change_cleaned.gpkg'] filepath_end = relevant_sections[0] + "/" + relevant_sections[1] + "/" + relevant_sections[2] full_filepath = filepath_start + filepath_end print(f"Saving file to {full_filepath}") # make the subdirectories if they do not yet exist directory_path = os.path.dirname(full_filepath) if not os.path.exists(directory_path): os.makedirs(directory_path) gdf.to_file(full_filepath, driver = "GPKG") print(f"Cleaning complete.") ```
julietcohen commented 1 year ago

Visualizing 2 cleaned adjacent UTM zones

Ideally for the config, we would set both aggregation_method and resampling_method to mode when visualizing the attribute ChangeRateNet_myr-1 but we must use something else besides mode for aggregation_method until that functionality is built into the workflow. For testing, we can try max.

This attribute ranges from approximately -10 to +5 with most values ranging from -2 to +2 (for the 2 UTM zones used for testing):

image

Therefore we should also include a val_range in the config [-2,2]. This keeps the palette consistent across z-levels, so lakes or portions of lakes do not change color as you zoom in. More extreme values that fall outside that range are assigned the color for -2 or 2.

image

To do:

julietcohen commented 1 year ago

Adjusting config option nodata_value and removing deduplication

The last few rounds of lake change visualization for the same UTM zones tested the following two changes in the config:

  1. setting nodata_value to 999 and None (instead of 0 which is used for IWP data, because 0 is a possible value for the visualized attribute in this lake change data)
  2. not deduplicating (used same staged data with flagged duplicates using the neighbor deduplication method, but re-rasterized and web-tiled with config set to not deduplicate so flagged duplicates are not actually removed)
  3. using min instead of max for the aggregation_method
lake_change_config.py ```python IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": 999, "nodata_color": "#ffffff00" # fully transparent white }, ], #"deduplicate_at": ["raster"], "deduplicate_at": None, #"deduplicate_keep_rules": [["Date","larger"]], "deduplicate_method": None, #"deduplicate_overlap_tolerance": 0.1, #"deduplicate_overlap_both": False, #"deduplicate_centroid_tolerance": None } ```

Results

1. setting nodata_value to 999 and None

Both of these nodata_values were problematic, resulting in a yellow layer over the tiles that spans from the prime meridian to the antimeridian:

image

This may be related to the fact that these values 999 and None are not present in the data at all, so the nodata_mask is of length 0 when we execute to_image() in viz-raster here. Because the nodata_value is not set to 0 anymore, the 0 values in the data (yellow in our palette) are not made into the mask, so they are retained. Consider the IWP config, where we do set the nodata_value to 0. The cells that have 0% coverage or 0 polygons are set to transparent. However, this still does not explain why the yellow layer expands all the way around half of the world, and is not limited to the region of the 2 UTM zones. The same dir of staged tiles is used to produce both web tile sets that do and do not have the yellow layer, so I suppose this means that the yellow layer is only in the web-tiles, and not the staged tiles themselves. More testing is needed to narrow down what's happening here.

2. Not removing flagged duplicates

In the web tiles, we still see the gap that I assume represents the border of the two UTM zones visualized:

image

The next step to figure out what's going on here is to plot all the lake data for both UTM zones and see if they overlap. Ingmar expressed that there certainly is overlap and there needs to be deduplication integrated either before the viz-workflow or during the viz-workflow.

3. using min instead of max for aggregation

This makes more sense considering that the most extreme values of the distribution for this attribute is negative, rather than positive. Hopefully this more correctly conveys the environmental trends shown in the data and shows more diverse values in the palette.

julietcohen commented 1 year ago

Update on deduplication

After looking into the code for the neighbors deduplication approach, as well as plotting the input data overlap, here are some observations:

julietcohen commented 1 year ago

Deduplication Successful

Changing the deduplicate_... config options corrected the deduplication:

lake_change_config.py ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": 0, "nodata_color": "#ffffff00" # fully transparent white }, ], "deduplicate_at": ["raster"], "deduplicate_keep_rules": [["Perimeter_meter","larger"]], # [property, operator], using property with all positive values "deduplicate_method": "neighbor", "deduplicate_overlap_tolerance": 0.5, # default value "deduplicate_overlap_both": False, # only 1 polygon must be overlapping with the deduplicate_overlap_tolerance threshold to be considered dups "deduplicate_centroid_tolerance": None # only deduplicate_overlap_tolerance will be used to determine if polygons are dups } ```
image

Python indexing syntax for neighbor dedup method

Output in the terminal during staging brought my attention to some syntax that could potentially be improved in the neighbor method.

image

The message might be referring to this syntax. The link to the pandas documentation that describes the better syntax is here.

julietcohen commented 1 year ago

I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment, but used np.nan for the nodata_val in the config (and made sure to import numpy as np at the top of the config)

lake_change_config.py ``` IWP_CONFIG = { "deduplicate_clip_to_footprint": False, "dir_output": OUTPUT, "dir_input": INPUT, "ext_input": ".gpkg", "dir_geotiff_remote": GEOTIFF_REMOTE, "dir_geotiff_local": GEOTIFF_LOCAL, "dir_web_tiles": WEBTILE_REMOTE, "dir_staged_remote": STAGING_REMOTE, "dir_staged_remote_merged": STAGING_REMOTE_MERGED, "dir_staged_local": STAGING_LOCAL, "filename_staging_summary": STAGING_REMOTE + "staging_summary.csv", "filename_rasterization_events": GEOTIFF_REMOTE + "raster_events.csv", "filename_rasters_summary": GEOTIFF_REMOTE + "raster_summary.csv", "version": datetime.now().strftime("%B%d,%Y"), "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "min", "resampling_method": "mode", "val_range": [ -2, 2 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise, "#0000ff"], # blue "nodata_val": np.nan, "nodata_color": "#ffffff00" # fully transparent white }, ], "deduplicate_at": ["raster"], "deduplicate_keep_rules": [["Perimeter_meter","larger"]], # [property, operator], using property with all positive values "deduplicate_method": "neighbor", "deduplicate_overlap_tolerance": 0.5, # default value "deduplicate_overlap_both": False, # only 1 polygon must be overlapping with the deduplicate_overlap_tolerance threshold to be considered dups "deduplicate_centroid_tolerance": None # only deduplicate_overlap_tolerance will be used to determine if polygons are dups } ```

The result is still this strange yellow layer that also resulted from setting nodata_val to 999 and None:

image

The ConfigManager.py specifes that the nodata_val should be able to be set to an integer, float, None, or np.nan.

If we do use 0 as the nodata_val here, which is the only value I have tried so far that doesn't result in that yellow layer, we would be using it the same way we do for the IWP data in the sense that in both datasets palettes we are not differentiating between regions where no data was collected and regions that have data but no polygons were identified/lakes size changed.

robyngit commented 1 year ago

I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment

Have you tried setting the nodata_val before rasterization? You could dig through the code to check if this is in fact how it works, but theoretically the workflow should set the pixels without any polygons to the no_dataval when first creating the highest level geotiffs.

julietcohen commented 1 year ago

Thanks for the feedback @robyngit! I'll re-start the workflow at the raster highest step rather than web-tiling, with no-data val set to 999 or np.Nan and see if that works.

I didn't think that the nodata value would be used in the raster highest step because searching for nodata_val within viz-raster shows that the only occurrences are when web tiles are created. Within RasterTiler.py, nodata_val occurs when we execute webtile_from_geotiff(). However, this was just a quick string search on Github rather than a deep dive into how you use nodata_val throughout the package, so it's totally possible that it's used earlier in the raster highest step and I'm missing it.

julietcohen commented 1 year ago

Maybe the raster highest step gets the nodata_val from get_raster_config() defined in viz-staging which is called within rasterize_vector() here

robyngit commented 1 year ago

Looks like I made it so that no data pixels are always zero in the the Raster class 😕, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L772-L773

Grid cells without any data will be filled with 0.

We would have to make it so that the 0 in the fill_value=0 line uses the nodata_val from the config, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L809

julietcohen commented 1 year ago

Update on nodata_val

In line with Robyn's observation, which I did not see until after I tried this, I re-rasterized the same deduplicated staged tiles for these 2 UTM zones, so started the workflow at raster highest, with the nodata_val set to 999. The resulting web tiles still have the yellow haze (see demo portal). I will open a branch to integrate that change to enable non-0 nodata_vals.

julietcohen commented 8 months ago

Ingmar helpfully created a new issue to document his data cleaning to remove seawater and rivers: https://github.com/PermafrostDiscoveryGateway/landsattrend-pipeline/issues/8

initze commented 6 months ago

Temporary pan-arctic dataset

julietcohen commented 6 months ago

Thank you, @initze ! Well done.

This data has been uploaded to Datateam at: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03 along with a README file. I included Ingmar's notes above, as well as my notes and links to the relevant github issues.

julietcohen commented 5 months ago

I did an initial check of the new, filtered version of the lake change dataset: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03/Lakes_global-001.parquet

There are NA values present in the attributes ChangeRateNet_myr-1 and ChangeRateGrowth_myr-1. As I did for the previous version of the lake change data that contained NA values for certain attributes, I can clean this data by removing the entire row if it contains NA values for an attribute.

I focused on this dataset today because it makes sense to use this data for 4 different but related goals for the PDG project, which are of varying degrees of priority:

julietcohen commented 5 months ago

Visualized Sample of Parquet Data (filtered for seawater and rivers)

I used a subset of 10,000 polygons of the Lake Change data in parquet format as input into the visualization workflow with 3 stats:

They are up on the demo portal.

config ``` workflow_config = { "dir_input": "/home/jcohen/lake_change_parquet", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 13 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "change_rate", "weight_by": "area", "property": "ChangeRateNet_myr-1", "aggregation_method": "mean", "resampling_method": "mode", "val_range": [ 0, 1 ], "palette": ["#ff0000", # red "#FF8C00", # DarkOrange "#FFA07A", # LightSalmon "#FFFF00", # yellow "#66CDAA", # MediumAquaMarine "#AFEEEE", # PaleTurquoise "#0000ff"], # blue "nodata_val": 0, "nodata_color": "#ffffff00" }, { "name": "lake_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": ["#20f8c6", # light turquoise "#2f20f8"], # dark blue "nodata_val": 0, "nodata_color": "#ffffff00" }, { "name": "lake_count", "weight_by": "count", "property": "centroids_per_pixel", "aggregation_method": "sum", "resampling_method": "sum", "palette": ["#20f8c6", # light turquoise "#2f20f8"], # dark blue "nodata_val": 0, "nodata_color": "#ffffff00" } ], "deduplicate_clip_to_footprint": False, "deduplicate_method": None } ```

ChangeRateNet_myr-1

image image

area_per_pixel_area (% coverage)

image image

centroids_per_pixel (number of lakes)

image

Zooming in, the pixels that represent the center of each lake are so high res (and therefore so small) that they are hard to see.

Notes

image
julietcohen commented 5 months ago

I created a larger sample of 20,000 polygons and reprocessed the data with a much smaller simplify_tolerance = 0.0001. This succeeded in retaining the lake shapes of the polygons.

image

The larger sample size highlighted that there are other polygons that appear to be seawater in addition to the island one pictured above in red in the last comment. So maybe the filtering could be refined.

seawater
julietcohen commented 3 months ago

Update on deduplication and filtering seawater

Since I resolved this issue, here's the path forward for processing the lake change data:

Updates on dataset metrics

Ingmar helpfully provided some big picture dataset info:

Question Answer
How many lakes are mapped? about 6-6.2 million.
Total dataset size? the uncleaned, merged dataset is 4.13GB
Total area of the lakes? the uncleaned, merged dataset is ~950,000 km² but it will become smaller once deduplication is applied
For the derived lake drainage dataset, would we include a binary attribute data for simply "this lake is present or drained", or would it be an ordered categorical attribute including "partially drained", "totally drained", and "not drained"? "Thats a tricky one. drained must typically refer to some reference date, as lakes are often polycyclic. Ideally we would have a drainage year. We have an attribute "NetChange_perc" which actually provides that information. In a paper I am working on we have >25% <= 75% loss as partially drained, >75% as completely drained. However change must be > 1 ha as small lakes/changes are quite uncertain"
For the derived lake drainage dataset, would the geometry be in the form of polygons (the perimeter of the lake) or a point? Polygons. For other analysis or visualization we can still reduce to centroid/points or something similar
julietcohen commented 2 months ago

Deduplication update

After some mapping in QGIS, and without doing systematic checks to confirm for sure, polygons that border or intersect the antimeridian seem to have been identified in Ingmar's lake change dataset when the neighbors deduplication approach was applied to 2 adjacent UTM zones, 32601-32602. UTM zone 32601 borders the antimeridian. The neighbors deduplication approach transforms the geometries from their original CRS into the one specified as distance_crs so the anitmeridian polygons are distorted and wrap the opposite way around the world. See this screenshot of the distorted 32601 (green) and 32602 (yellow)

image

This results in polygons in zone 32602 overlapping spatially with polygons from 32601 that are not actually in the region of overlap. Here's a screenshot from Jonas (a collaborator with Ingmar and Todd) who mapped the polygons that were identified as duplicates in red: 32601_32602

Mapping the flagged duplicate polygons on top of the distorted zone 32601 shows suspicious overlap: image (1)

Next steps:

If this is intersection with the antimeridian is indeed the cause of the incorrectly flagged duplicates (those that lie outside of the overlap between 2 adjacent UTM zones), then the neighbor deduplication should work well for all zones that do not intersect the antimeridian. Todd is looking into applying to approach to all UTM zones, and figuring out how to get around it for zone 32601. My recommendation is to identify which polygons do cross the antimeridian, split them, and buffer them slightly away from the line. Code for this was applied to the permafrost and ground ice dataset here.

iona5 commented 2 months ago

Hi, this is Jonas from AWI.

I was initially asked by Todd regarding the files but i was not sure about the problem. But your post motivated me to look into this a bit further. I am not familiar with the details of the de-duplication but i understand that you reproject the UTM layers (32601, 32601) to EPSG:3857 to find the spatially overlaps which causes the vertices on the other side of the antimeridian to wrap around?

I guess you should try to project both datasets into a CRS where the coordinates do not wrap around at the antimeridian for these areas. I for example used EPSG:3832 to create the very screenshot you showed above to get rid of the polygon distortion. I didn't realize that this was actually the problem, so i didn't even mention this to Todd. I am optimistic if you project into EPSG:3832 you'll get the correct results for the UTM zones bordering the antimeridian.

Another option would be to just declare the datasets to be a non-critical UTM zone (for example 32603 and 32604) basically translating all polygons to the east before reprojection, so the longitude coordinates do not wrap around in EPSG:3857. I guess the inaccuracy is negligible. After de-duplication project back to 32603/04 and then translate back to 32601/02. But i guess the first option is preferable.

Sorry if I misunderstood the problem or if i am totally off track here.

julietcohen commented 2 months ago

Hi Jonas, thanks for your suggestions! You are correct that the CRS is configurable for the deduplication step, and I simply used the default projected CRS in my example script that I passed off to Todd. Todd and Ingmar wanted an example of how to execute the deduplication with adjacent UTM zones before the data is input into the visualization workflow. Given my example script and explicit parameters, your team can change the parameters as you see fit.

The transformation to the new CRS, EPSG 3857, is indeed what causes the geometries to become deformed. However, this CRS is not the only CRS that causes this issue, and the deduplication step is not the only time we transform the data in the workflow. I have encountered similarly deformed geometries in other datasets when converting to EPSG 4326 (not projected), which we do during the initial standardization of the data during the "staging" step. That unprojected CRS is required for the Tile Matrix Set of our viz workflow's output geopackage and geotiff tilesets. This means that you may be able to deduplicate the lake change data prior to the visualization workflow with a different CRS and retain the original lake geometries that border the antimeridian, but they will likely be problemic in the same way when we create tilesets from them, requiring buffering from the antimeridian as a "cleaning"/preprocessing step before staging anyway.

Plots in python

Original CRS

import geopandas as gpd
import matplotlib.pyplot as plt

gdf = gpd.read_file("~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg")

fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

4326 causes deformed geometries

gdf_4326 = gdf.to_crs(epsg = 4326, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_4326.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

3857 deforms geometries

gdf_3857 = gdf.to_crs(epsg = 3857, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3857.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

3832

gdf_3832 = gdf.to_crs(epsg = 3832, inplace = False)

fig, ax = plt.subplots(figsize=(10, 10))
gdf_3832.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
image

This last plot shows the data transformed into your suggested CRS, and as you suggested it shows no wrapping around the world, however the description of that CRS makes me wonder if the transformation to that CRS would slightly deform the geometries since it appears that the suggested use area for that CRS does not contain UTM zone 32601. But please correct me if I am wrong. But this amount of deformation may likely be negligible.

I don't follow your last suggestion fully, but it does sound like a function I know exists in R, ST_ShiftLongitude.

Resources:

Documentation for the neighbor deduplication approach can be found here. The source code is here.

iona5 commented 2 months ago

Your concerns make sense (but i would personally argue the 3857 isn't really precise either but that's another can of worms 😉 ).

I might find the time to look into this further, but for now i guess in the end it boils down to the origin of the reference system. From what i remember from my GIS lectures:

In most predefined reference systems with roughly global coverage the longitude/x-origin is located at the 0° meridian with an extent of -180° to +180° in the lon-axis. If you project a polygon overlapping the antimeridian into these CRS, some vertices of those polygons will be transformed to having an x-coordinate close to -180 and others close to +180 (if in degrees). That is what happens here.

However, this is just a way the coordinate extent is defined and that can be changed. One can simply modify it to set the origin close to the area of interest. You can observe those in the CRS config (I think its most convenient and insightful to look at the PROJ4 string):

EPSG:4326: +proj=longlat +datum=WGS84 +no_defs +type=crs EPSG:3857: +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs EPSG:3832: +proj=merc +lon_0=150 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs

note 4326 and 3857 use a +lon_0 value of 0, while 3852 uses a +lon_0 valueof 150. That means it sets its origin to 150° east (see https://proj.org/en/9.4/usage/projections.html). So in that case the x-coordinates of the vertices of the polygons in the lake_change dataset are all set around the +30° value which corresponds to the antimeridian.

So if we want to stick to EPSG:3857 we can use it as a base:

For example in QGIS you can define this custom CRS (in Settings -> Custom projections), +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs which is just EPSG:3857 with the origin rotated (+lon_0=180) and it displays both files just fine.

image

So if you transform both GeoDataFrames into that custom CRS before getting the intersection i think it might work and you are technically still in 3857, just the origin is different:

import pyproj
import geopandas as gpd

gdf = gpd.read_file("lake_change_32601.gpkg")

crs_3857_rotated = pyproj.CRS.from_proj4("+proj=merc +a=6378137 +b=6378137 "
   "+lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 "
   "+units=m +nadgrids=@null +wktext +no_defs")

gdf_3857r = gdf.to_crs(crs_3857_rotated, inplace = False)

gdf_3857r.plot(cmap = 'viridis', linewidth = 0.8)

image

Note that for production one might consider not using the proj4 string for the definition of the custom CRS, but the WKT string, which is presumably more precise: https://proj.org/en/9.4/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems . the Proj library provides conversion tools, the setting corresponing to lon_0 is imho

...
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
...

but i didn't test that.

I looked in your code to test the deduplication with this, but it seems like the projection is done beforehand (i.e. outside of deduplicate_neighbors() ) and i hadn't had the time to mock something up. But i guess if you use this custom CRS for data in the UTM zones close to the antimeridian, it will work.

julietcohen commented 2 months ago

Thanks so much for your thorough explanation of your idea, that's an interesting approach! Since the plan is for this deduplication to be completed by @tcnichol prior to passing the data off to me for visualization and standardization, maybe he can read through our suggestions and choose how to move forward.

I'd like to clarify when the CRS transformations take place. Since deduplication will occur before staging for this dataset, the transformation within the duplicate flagging process is the first time. Within deduplicate_neighbors, the parameter distance_crs defines the CRS that the data will be transformed into. The transformation occurs here, and note that it is done on the geometry points to check the distance between the centroids and identify the duplicates. The next time the data is transformed is here during staging, after you pass off the data to me.

Lastly, I'd like to highlight that flagging the duplicates and removing the duplicates are different steps. When we flag the duplicates, we create a boolean column, which I set to be called staged_duplicated for the parameter prop_duplicated in deduplicate_neighbors, to indicate if each polygons is a duplicate (True) or not (False). Your team is familiar with this, because the example script I provided outputs the file 32617-32618_dups_flagged.gpkg. So as a final step before passing the data to me, simply remove the rows that are True in each GDF.

julietcohen commented 2 months ago

Identify which polygons intersect the 180th longitude line and split them

Before we determine which approach to take to deal with these polygons that cross the antimeridian, either by splitting them with a buffered the 180th degree longitude or creating a custom CRS with a different meridian, I want to first answer the question: Are the polygons that intersect the antimeridian lake detections or simply seawater and river polygons that are already supposed to be filtered out before deduplication and visualization anyway?

I did this exploration in R. The plots show that all but 1 polygons that intersect the antimeridian are seawater or rivers, which should be removed by Todd and Ingmar's filtering prior to deduplication and visualization.

explore_32601.R ```r # Author: Juliet Cohen # Date: 2024-08-05 # Explore the relationship between lakes in 32601 and 180th degree longitude # PART 1) example of how to split polygons with buffered antimeridian # PART 2) subset and plot the polygons to just those that intersect the # antimeridian, helping determine if most or all of these are seawater # and river polys, rather than lake polys which will be the actual input # data to the viz-workflow library(sf) library(ggplot2) library(leaflet) # interactive mapping library(mapview) # PART 1 ----------------------------------------------------------------------- # Split polygons at the buffered antimeridian # UTM zone 32601, which has NOT been filtered for seawater or rivers yet fp = "~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg" gdf_32601 <- st_read(fp) %>% st_set_crs(32601) # transform data to EPSG:4326, WGS84 gdf_4326 <- st_transform(gdf_32601, 4326) # define the max and min y values to limit antimeridian line to relevant area bbox <- st_bbox(gdf_4326) ymin = bbox["ymin"] ymax = bbox["ymax"] # create a line of longitude at 180 degrees (antimeridian) in EPSG:4326 AM <- st_linestring(matrix(c(180, 180, ymin, ymax), ncol = 2)) %>% st_sfc(crs = 4326) # plot polygons (in 32601) with AM line (in 4326) ggplot() + geom_sf(data = gdf_32601) + geom_sf(data = AM, color = "red", size = 1) + # adjust the background grid to either CRS, 4326 or 32601 # coord_sf(crs = st_crs(32601), datum = st_crs(32601)) + labs(title = "Lakes in UTM 32601 with 180° longitude line") + theme_minimal() # buffer the antimeridian line buffered_AM <- st_buffer(AM, dist = 0.1) buffered_AM_32601 <- st_transform(buffered_AM, 32601) # split the polygons with the buffered AM split_polys <- st_difference(gdf_32601, buffered_AM_32601) # convert split polygons to 4326 because this is required by leaflet split_polys_4326 <- st_transform(split_polys, 4326) map <- leaflet(split_polys_4326) %>% addTiles() %>% # add default OpenStreetMap map tiles addPolygons() map # PART 2 ----------------------------------------------------------------------- # Determine which polygons actually cross the antimeridian # Both geosaptial objects MUST be within the same CRS, # so use the unbuffered antimeridian, but convert it to CRS 32601. # NOTE: cannot go the other way (transform the polys to 4326), # because they wrap the other way around the world, # because have not yet been split yet AM_32601 <- st_transform(AM, 32601) intersect_polys <- gdf_32601[st_intersects(gdf_32601, AM_32601, sparse = FALSE), ] # split the intersecting polygons with the buffered AM defined earlier in script, # because need the buffer to move polygon side away from the antimeridian intersect_split_polys <- st_difference(intersect_polys, buffered_AM_32601) intersect_split_polys_4326 <- st_transform(intersect_split_polys, 4326) intersect_map <- leaflet(intersect_split_polys_4326) %>% addTiles() %>% addPolygons() intersect_map ```
julietcohen commented 2 months ago

Note that the same exploration should be done for the UTM zone on the other side of the 180th degree longitude and for any zones further south that also touch the meridian may be included in the analysis

julietcohen commented 1 month ago

Todd will first apply the seawater and river filtering to the lake change data (including the final step of removing the rows where the polygon has been flagged as False for within_land), then will apply the antimeridian polygon splitting to the relevant UTM zones (for the few polygons that need it), then will apply the deduplication, then it is my understanding that the data can be passed to me for visualization.

julietcohen commented 1 month ago

Notes from a meeting today, Aug 16th are in the PDG meeting notes. They outline the next steps for Ingmar and Todd to process all UTM zones with filtering, deduplication, and merging, then validate the data with the geohash ID's. Regarding the lake polygons that intersect the antimeridian, Todd is not sure how many are in the dataset. Ingmar suggested that if they stick to polar projections for all steps for data processing prior to the visualization workflow, then the intersection the antimeridian will not be a problem at all. While this is true, this will still be a problem when the data is input into the viz workflow (as noted above here as well) because we use EPSG:4326. I emphasize this because if there are lake polygons that intersect the antimeridian then whoever runs the viz workflow on this data will need to split those polygons prior to processing with one of the methods I documented above.