PermafrostDiscoveryGateway / pdg-portal

Design and mockup documents for the PDG portal
Apache License 2.0
0 stars 0 forks source link

Add aggregation datasets for lake time series data: Permafrost Database for Northern Alaska and Circumpolar Thermokarst Landscapes #37

Open initze opened 1 year ago

initze commented 1 year ago

Datasets for lake statistics aggregation

Regional Dataset

Jorgenson: Ecological Mapping and Permafrost Database for Northern Alaska (Final Ecological Mapping Update (2014))

Data Download Link (zipped shp)

Archive/Dataset Link: https://catalog.northslopescience.org/dataset/2236

For data aggregation I would propose to use the field "ECOREGION" and "LITHOLOGY" to start with, i guess once set up we could add others

ECOREGION

LITHOLOGY

Pan-Arctic Dataset

Olefeldt: Circumpolar Thermokarst Landscapes, 2015, Circum-Arctic

[Data Download Link (GPKG), with fixed geometries (I had some issues with original file)] https://1drv.ms/u/s!AobXXrP933xWh8lqz5It06Zf8AT9JA?e=yBQaDY

Archive/Dataset Link: https://apgc.awi.de/dataset/ctl-arctic

For data aggregation I would propose to use the field "TKThLP"

grafik

initze commented 1 year ago

Google slides with some simple sketches for data visualizations.

in progress --> will be updated from time-to-time

https://docs.google.com/presentation/d/1egijizjw9boYxuuMBpfGwE7tX4M4KmYiJq0wbocO3Zc/edit#slide=id.g22b5ba41013_0_0

initze commented 1 year ago

Lake area time-series

Here we have 2 datasets: GDrive Link: https://drive.google.com/drive/folders/1RH3G-u6mKRSZ82Fok9I1CPUpK8vj8EtH?usp=share_link

Both datasets can be joined via the ID_merged attribute

  1. INitze_Lakes_merged_v3_PDG_Testset.gpkg

    • Geopackage file with polygon geometries with static lake change statistics for the period 1999-2014 from Nitze et al 2018
    • published dataset with some cleaning, merged 4 transects
  2. Lakes_IngmarPDG_annual_area.nc

julietcohen commented 1 year ago

Thank you for this dataset, metadata, and suggestions for processing, Ingmar! Very much looking forward to visualizing this on the PDG.

julietcohen commented 1 year ago

The 2 files for lake area time series have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/time_series_2023-04-05

julietcohen commented 12 months ago

Update: After rearranging Datateam directories, the data now is on Datateam at: /var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05

julietcohen commented 11 months ago

Initial exploration

GeoPackage:

NetCDF:

julietcohen commented 11 months ago

Combined data: Lake Area Time Series

Converted the NetCDF file to a dataframe:

image

Combined with the GeoPackage with an inner join, retaining only the ID_merged values that are within both files. The merged data lives in: /var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/merged_lakes_area.gpkg

merge.py ```python # Merge the two lake change datasets from Ingmar Nitze # Author: Juliet Cohen # Date: 07/26/23 # use conda environment "geospatial" import geopandas as gpd import xarray as xar import pandas as pd # Lake area time series data GeoPackage gdf = gpd.read_file("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/INitze_Lakes_merged_v3_PDG_Testset.gpkg") # Lake area time series data NetCDF area = xar.open_dataset("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/Lakes_IngmarPDG_annual_area.nc") # convert the NetCDF file into a dataframe, with columns for: # ID_merged, year, permanent_water, seasonal_water area_df = area.to_dataframe().reset_index() # merge the dataframe, retaining only ID_merged values (lakes) that exist in both files # the spatial dataframe must be the left argument to retain the geometries merged_data = gdf.merge(right = area_df, how = 'inner', # only retain lakes with data for the measurements for permanent water and seasonal water on = 'ID_merged') # Save as a gpkg for input into the viz-workflow merged_data.to_file("/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/merged_lakes_area.gpkg", driver = "GPKG") ```

Next step: Set up a parsl run to process this file for initial testing, with no deduplication and max z-level 11. The gpkg is quite large, so reading it in take a long time. When working with the entire lake change time series dataset (rather than just these few UTM zones) it is a good idea to separate the merged gpkg files by UTM zone so they are easier to parallelize with batches during staging.

julietcohen commented 7 months ago

Update on lake area time series visualization

julietcohen commented 7 months ago

First draft of lake size time series dataset

image
config ``` { "deduplicate_clip_to_footprint": false, "deduplicate_method": "None", "deduplicate_at": "None", "dir_output": "/home/jcohen/lake_change_time_series/parsl_workflow/", "dir_input": "/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/exploded/", "ext_input": ".gpkg", "dir_staged": "/home/jcohen/lake_change_time_series/parsl_workflow/staged/", "dir_geotiff": "/home/jcohen/lake_change_time_series/parsl_workflow/geotiff/", "dir_web_tiles": "/home/jcohen/lake_change_time_series/parsl_workflow/web_tiles/", "filename_staging_summary": "/home/jcohen/lake_change_time_series/parsl_workflow/staging_summary.csv", "filename_rasterization_events": "/home/jcohen/lake_change_time_series/parsl_workflow/raster_events.csv", "filename_rasters_summary": "/home/jcohen/lake_change_time_series/parsl_workflow/raster_summary.csv", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 11 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "permanent_water", "weight_by": "area", "property": "permanent_water", "aggregation_method": "max", "resampling_method": "mode", "val_range": [ 0, 4000 ], "palette": [ "#fae500", "#00d8db", "#2100db", "#9a00fa" ], "nodata_val": 0, "nodata_color": "#ffffff00" } ] } ```
parsl workflow ``` # Processing the merged lake change time series data sample from Ingmar # see issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/37 # staging through web tiling with parsl # conda environment: viz-local, with local installs # for viz-staging and viz-raster, main branch # and pip install for parsl==2023.11.27 from datetime import datetime import json import logging import logging.handlers import os import pdgstaging import pdgraster import parsl from parsl import python_app from parsl.config import Config from parsl.channels import LocalChannel from parsl.executors import HighThroughputExecutor from parsl.providers import LocalProvider import subprocess from subprocess import Popen user = subprocess.check_output("whoami").strip().decode("ascii") # configure logger logger = logging.getLogger("logger") # Remove any existing handlers from the logger for handler in logger.handlers[:]: logger.removeHandler(handler) # prevent logging statements from being printed to terminal logger.propagate = False # set up new handler handler = logging.FileHandler("/tmp/log.log") formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) logger.addHandler(handler) logger.setLevel(logging.INFO) activate_conda = 'source /home/jcohen/.bashrc; conda activate viz-local' htex_local = Config( executors=[ HighThroughputExecutor( label="htex_local", worker_debug=False, #cores_per_worker=1, max_workers=10, provider=LocalProvider( #channel=LocalChannel(), init_blocks=1, max_blocks=1, worker_init=activate_conda ), ) ], ) parsl.clear() parsl.load(htex_local) def run_pdg_workflow( workflow_config, batch_size = 300 ): """ Run the main PDG workflow for the following steps: 1. staging 2. raster highest 3. raster lower 4. web tiling Parameters ---------- workflow_config : dict Configuration for the PDG staging workflow, tailored to rasterization and web tiling steps only. batch_size: int How many staged files, geotiffs, or web tiles should be included in a single creation task? (each task is run in parallel) Default: 300 """ start_time = datetime.now() logging.info("Staging initiated.") stager = pdgstaging.TileStager(workflow_config) #tile_manager = rasterizer.tiles tile_manager = stager.tiles config_manager = stager.config input_paths = stager.tiles.get_filenames_from_dir('input') input_batches = make_batch(input_paths, batch_size) # Stage all the input files (each batch in parallel) app_futures = [] for i, batch in enumerate(input_batches): app_future = stage(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(input_batches)}') # Don't continue to next step until all files have been staged [a.result() for a in app_futures] logging.info("Staging complete.") # ---------------------------------------------------------------- # Create highest geotiffs rasterizer = pdgraster.RasterTiler(workflow_config) # Process staged files in batches logging.info(f'Collecting staged file paths to process...') staged_paths = tile_manager.get_filenames_from_dir('staged') logging.info(f'Found {len(staged_paths)} staged files to process.') staged_batches = make_batch(staged_paths, batch_size) logging.info(f'Processing staged files in {len(staged_batches)} batches.') app_futures = [] for i, batch in enumerate(staged_batches): app_future = create_highest_geotiffs(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(staged_batches)}') # Don't move on to next step until all geotiffs have been created [a.result() for a in app_futures] logging.info("Rasterization highest complete. Rasterizing lower z-levels.") # ---------------------------------------------------------------- # Rasterize composite geotiffs min_z = config_manager.get_min_z() max_z = config_manager.get_max_z() parent_zs = range(max_z - 1, min_z - 1, -1) # Can't start lower z-level until higher z-level is complete. for z in parent_zs: # Determine which tiles we need to make for the next z-level based on the # path names of the geotiffs just created logging.info(f'Collecting highest geotiff paths to process...') child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1) logging.info(f'Found {len(child_paths)} highest geotiffs to process.') # create empty set for the following loop parent_tiles = set() for child_path in child_paths: parent_tile = tile_manager.get_parent_tile(child_path) parent_tiles.add(parent_tile) # convert the set into a list parent_tiles = list(parent_tiles) # Break all parent tiles at level z into batches parent_tile_batches = make_batch(parent_tiles, batch_size) logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.') # Make the next level of parent tiles app_futures = [] for parent_tile_batch in parent_tile_batches: app_future = create_composite_geotiffs( parent_tile_batch, workflow_config) app_futures.append(app_future) # Don't start the next z-level, and don't move to web tiling, until the # current z-level is complete [a.result() for a in app_futures] logging.info("Composite rasterization complete. Creating web tiles.") # ---------------------------------------------------------------- # Process web tiles in batches logging.info(f'Collecting file paths of geotiffs to process...') geotiff_paths = tile_manager.get_filenames_from_dir('geotiff') logging.info(f'Found {len(geotiff_paths)} geotiffs to process.') geotiff_batches = make_batch(geotiff_paths, batch_size) logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.') app_futures = [] for i, batch in enumerate(geotiff_batches): app_future = create_web_tiles(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(geotiff_batches)}') # Don't record end time until all web tiles have been created [a.result() for a in app_futures] end_time = datetime.now() logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: ' f'{end_time - start_time}') # ---------------------------------------------------------------- # Define the parsl functions used in the workflow: @python_app def stage(paths, config): """ Stage a file """ from datetime import datetime import json import logging import logging.handlers import os import pdgstaging # configure logger: logger = logging.getLogger("logger") # Remove any existing handlers from the logger for handler in logger.handlers[:]: logger.removeHandler(handler) # prevent logging statements from being printed to terminal logger.propagate = False # set up new handler handler = logging.FileHandler("/tmp/log.log") formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) logger.addHandler(handler) logger.setLevel(logging.INFO) stager = pdgstaging.TileStager(config) for path in paths: stager.stage(path) return True # Create highest z-level geotiffs from staged files @python_app def create_highest_geotiffs(staged_paths, config): """ Create a batch of geotiffs from staged files """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster # configure logger: logger = logging.getLogger("logger") # Remove any existing handlers from the logger for handler in logger.handlers[:]: logger.removeHandler(handler) # prevent logging statements from being printed to terminal logger.propagate = False # set up new handler handler = logging.FileHandler("/tmp/log.log") formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) logger.addHandler(handler) logger.setLevel(logging.INFO) # rasterize the vectors, highest z-level only rasterizer = pdgraster.RasterTiler(config) return rasterizer.rasterize_vectors( staged_paths, make_parents = False) # no need to update ranges because only using coverage statistic, which ranges 0-1 # ---------------------------------------------------------------- # Create composite geotiffs from highest z-level geotiffs @python_app def create_composite_geotiffs(tiles, config): """ Create a batch of composite geotiffs from highest geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster # configure logger: logger = logging.getLogger("logger") # Remove any existing handlers from the logger for handler in logger.handlers[:]: logger.removeHandler(handler) # prevent logging statements from being printed to terminal logger.propagate = False # set up new handler handler = logging.FileHandler("/tmp/log.log") formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) logger.addHandler(handler) logger.setLevel(logging.INFO) rasterizer = pdgraster.RasterTiler(config) return rasterizer.parent_geotiffs_from_children( tiles, recursive = False) # ---------------------------------------------------------------- # Create a batch of webtiles from geotiffs @python_app def create_web_tiles(geotiff_paths, config): """ Create a batch of webtiles from geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster # configure logger: logger = logging.getLogger("logger") # Remove any existing handlers from the logger for handler in logger.handlers[:]: logger.removeHandler(handler) # prevent logging statements from being printed to terminal logger.propagate = False # set up new handler handler = logging.FileHandler("/tmp/log.log") formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) logger.addHandler(handler) logger.setLevel(logging.INFO) rasterizer = pdgraster.RasterTiler(config) return rasterizer.webtiles_from_geotiffs( geotiff_paths, update_ranges = False) # no need to update ranges because the range is [1,4] for # all z-levels, and is defined in the config def make_batch(items, batch_size): """ Create batches of a given size from a list of items. """ return [items[i:i + batch_size] for i in range(0, len(items), batch_size)] # ---------------------------------------------------------------- # run the workflow config_file = '/home/jcohen/lake_change_time_series/parsl_workflow/config.json' logger.info(f'🗂 Workflow configuration loaded from {config_file}') print("Loaded config. Running workflow.") logger.info(f'Starting PDG workflow: staging, rasterization, and web tiling') run_pdg_workflow(config_file) # Shutdown and clear the parsl executor htex_local.executors[0].shutdown() parsl.clear() # transfer log from /tmp to user dir cmd = ['mv', '/tmp/log.log', f'/home/{user}/lake_change_time_series/parsl_workflow/'] # initiate the process to run that command process = Popen(cmd) print("Script complete.") ```
julietcohen commented 7 months ago

Producing annual layers for static lake size: permanent_water, 2017-2021

Anna requested that in anticipation of the google fellows starting on the team in January, our top priority should be to create annual layers for lake size for at least permanent_water and also potentially seasonal_water. Ingmar suggested that this be done for the past 5 years of data, because those years have the best quality observations and the fewest NA values for these attributes. I merged the data just as before, plus the step to remove NA values from these attributes that was not present before. I then subset the merged data (all transects) to 5 annual files.

At our meeting, Ingmar helpfully suggested using a mathematical way to distribute the color palette so that all colors are represented, even with skewed data. The way we have done it previously included plotting the input data attribute values in a binned histogram, and setting the range in the config to the min and max values that encompass the vast majority of the data, so the outliers are assigned the same color as the min and max values (depending on which side of the range they fall), and that way the data is much easier to interpret in a map. In order to attempt to introduce a more mathematical approach, I calculated the 95th quantile for each of the annual files. The values are similar for each attribute across the 5 years.

annual percentiles ``` 2017 permanent water 95th percentile: 48.29246735806339 2017 seasonal water 95th percentile: 2.6606105447148294 2018 permanent water 95th percentile: 48.476413012662455 2018 seasonal water 95th percentile: 2.471091755819858 2019 permanent water 95th percentile: 48.25663980988141 2019 seasonal water 95th percentile: 2.6432749690066695 2020 permanent water 95th percentile: 47.909318479470706 2020 seasonal water 95th percentile: 3.0058684271837874 2021 permanent water 95th percentile: 47.84887662863641 2021 seasonal water 95th percentile: 2.8591087319313626 ```
julietcohen commented 7 months ago

First version of the 2017 lake time series data

There are 2 new layers. One is the "permanent water", and the other is "seasonal water". Units are hectares.

image
initze commented 7 months ago

@julietcohen Could you add the link to the dev version? I just lost it

for Ingmar: I did not use deduplication because of the lack of overlap I observed when plotting the data. Please correct me if I misinterpreted this, and I actually should use deduplication.

In your dataset, the four regions are separate from each other. Technically they were merged and deduplicated in a previous step. So here there should be no need for that.

the same blue palette was used for both statistics, but the numerical ranges of the data are different, since generally the permanent water values are larger than the seasonal water values

Maybe we can increase the max val (upper limit) of the visualization. For my taste the differences are not super great.

Some cool idea

we could calculate the diff of each year versus the previous (or some kind of aggregated number of years), e.g. 2019-2018. Then we should be able to visualize major changes such as lake drainage 😄

julietcohen commented 7 months ago

@initze Thank you for your feedback. Here is the link to the PDG demo portal: https://demo.arcticdata.io/portals/permafrost

I'll increase the upper limit of the range in my config file and see if that helps create more diversity in colors associated with polygons of different sizes. For example, I could increase the quantile that I am using to set the max value. I was using the 95th quantile, which resulted in a max range value for permanent water of ~49. Increasing to the 99.99th quantile would be a max value of ~6,088.

Calculating the diff of each year would be very interesting! I do not think it would be difficult either, since I have already produced the annual files. I can ensure that each file has the same geometries, and we can take the difference for the variables of interest for each geometry. I can work on that after I visualize the 5 years independently.

Also, I think maybe a purple palette would be nice for the seasonal_water to contrast the blue that we can keep for the permanent_water. Just considering that the IWP layer is yellow, and the Arctic communities layer is green. Using red for any layer that represents a static attribute such as permanent or seasonal water seems like it would imply decreasing, so we should reserve red for the end of the spectrum for a variable that represents loss of size.

julietcohen commented 7 months ago

Another note: I have increased the maximum z-level for this data to z-12 (just 1 level higher) after consulting the metadata for the TMS, because we want the resolution of this data to be ~30m (correct me if that's wrong)

julietcohen commented 6 months ago

2017 Lake Data Update

I created web tiles for 2017 with different palettes for permanent water and seasonal water, and used the 99.99th percentile for the highest value in the palette range to show more diversity of colors in the lower values. These are now on the demo portal.

image image

Polygon color diversity

Speaking about the visual colors of the polygons, we see that changing the range in the workflow config to the 99.99th percentile for the max value did succeed in showing more diversity for permanent water. In my opinion, the pink & purple palette for seasonal water does not show enough diversity in the smaller lakes, and I should re-process the web tiles with a different percentile for the max value for that attribute. Maybe @initze has feedback on this. We should also keep in mind that the more tailoring we have to do in order to find the best range for the config values to show the most color diversity, the further this workflow gets from being automated. Ideally, we would mathematically determine the best range of values for the config to optimize the color diversity in the polygons without having to guess and check each time. I would appreciate input from anyone who has a good way to achieve this.

Legends

The legends for these 2 layers are accurate, as the max value shown for each layer is indeed the max value for that attribute, not the 99.99th percentile. Additionally, both layers have a min value of 0, and this was the min value in the legend. However, the value range for permanent water is so large that we encounter the same issue we ran into with the Arctic Communities layer: when you hover over the legend, it shows scientific notation, which is accurate but not ideal. This was acceptable for the communities layer, so I assume this is acceptable for this layer, too, for now.

Processing other lake data years

I have moved forward to process 2018-2021, and have already completed 2018 (update: 2018 geotiff data was corrupted during a directory transfer cancelled midway when VS code lost connection. geotiffs need to be re-created). However, there's no point in processing the web tiles for these years until we determine the best way to set the range for each attribute, based on our guess & checks for 2017. I will continue to process staged tiles and geotiffs for all years, but it would be very time consuming to guess and check for the optimal range of values to best represent the polygons.

julietcohen commented 6 months ago

2017 Lake Data Update

We agreed that the 99.99th percentile for the blue palette for permanent water looks good. Increasing the percentile used for the max value of the range for the config reduced the amount of darkest blue color, and increased the amount of lighter shades in the polygons.

For the pink & purple palette for the seasonal water, we had the opposite problem: too much of the lighter shades, and not enough darker shades, so the approach is to reduce the percentile. I tried 92, but that was too much! This resulted in too many polygons with the darkest purple color. See the same region as in the last comment for comparison:

image

Then created web tiles with the 95th percentile for the highest value in the palette range. There is still too much purple:

image

But importantly, if we zoom in more to that same area pictured above, or in a different area such as the north slope of AK, we can see the 95th percentile works pretty well:

image

Comparing percentiles with the seasonal water data distribution

We can visualize 5 different percentiles with the data distribution here:

image
julietcohen commented 6 months ago

I have simplified and added documentation to the script used to clean, merge, and parse Ingmar's 2 input files into the 5 most recent years (merge_explode_annual.py). I ran it to re-produce the annual input files for the visualization workflow. This script has been uploaded to the new ADC package I created for this data. Now that the input files for the viz workflow have been re-produced, I will start again with 2017 at staging through web tiling in one sweep, with the max value of the range for the permanent water being the 99.99th percentile, and the max value for seasonal water being the 95th percentile. I will do the same for the following 4 years. Keeping these percentiles the same for all years would be ideal.

julietcohen commented 6 months ago

All years 2017-2021 have been processed into staged, geotiffs, and web tiles for permanent water and seasonal water. All processing was done on Delta with the ray workflow. The 2017-2018 files have already completed the transfer to Datateam with the pre-issued DOI A28G8FK10, and they are visualized on the demo portal. Years 2019-2021 have not yet fully transferred, Globus has been experiencing some NCSA endpoint authentication issues for the past week, making large transfers take >3 days and sometimes the transfers fail all together. So they will be up on the demo portal when those make it through!