PermafrostDiscoveryGateway / pdg-portal

Design and mockup documents for the PDG portal
Apache License 2.0
0 stars 0 forks source link

Display the infrastructure layer #27

Closed robyngit closed 2 months ago

robyngit commented 1 year ago

The data is associated with the following two papers:

The data are archived with Restricted Access on Zenodo

robyngit commented 1 year ago

I did a test run of this layer. Here are some notes on processing it:

I used the following config options, but we will probably want to create tiles higher resolution than z 10 (perhaps 13):

{
  "z_range": [0, 10],
  "statistics": [
    {
      "name": "coverage",
      "weight_by": "area",
      "property": "area_per_pixel_area",
      "aggregation_method": "sum",
      "resampling_method": "average",
      "val_range": [
        0,
        1
      ],
      "nodata_val": 0,
      "palette": "oryel",
      "nodata_color": "#ffffff00"
    }
  ]
}

Preview in Cesium:

Screen Shot 2022-09-28 at 16 45 11
robyngit commented 1 year ago

Based on feedback from Annett (below), we should do the following with the next run:

We should also remember to re-create the layer when updated data is available next year.

Details:

If displaying as raster, I would suggest to do no resampling which does averaging as it is discrete information (yes or no/classes). Otherwise you are introducing new information (on size, bright if narrow, dark if it is a larger polygon), what could be misunderstood as some thematic content. I would suggest 'nearest neighbour' instead of 'bilinear' . In that case you could also differentiate types (road, building, other).

Note that we are just about to complete an updated version. It covers a slightly larger area and has additional classes (three road types, airstrips and reservoirs as extra/additional objects). But I assume that quality control will not be finished before the end of the year.

julietcohen commented 6 months ago

Annett has produced a new version of this dataset, now located on Datateam at: /var/data/submission/pdg/bartsch_infrastructure/SACHI_v2/

Old version of the data is now in /var/data/submission/pdg/bartsch_infrastructure/old_version/

Initial notes:

julietcohen commented 6 months ago

Notes:

image
julietcohen commented 6 months ago

Data after converting to EPSG 4326:

image
julietcohen commented 6 months ago

I was able to process a first draft of outputs with the visualization workflow, but importantly in order to produce the web tiles I needed to use the same modification to to_image() in this pull request that I originally used to successfully produce web tiles for the permafrost and ground ice layer (see this comment in that issue). While processing both datasets, they gave the same error and failed to write any web tiles until I added a line in to_image() that converts the image data to float. This gives me more confidence that this PR should be merged. A similarity of these 2 datasets is they both visualize categorical variables. In the permafrost dataset, we are visualizing 4 categories of permafrost coverage, and in this infrastructure dataset we are visualizing 7 types of infrastructure.

julietcohen commented 6 months ago

On Zenodo, I was granted access to the older version of the dataset which includes a README that includes attribute descriptions, which I was hoping would explain the infrastructure codes we see in this newer version of the dataset. Unfortunately, the attribute DN is not described in the older README.

First draft of the data layer with low resolution (set to max z10):

image image

To do:

script ```python # process the infrastructure data # start at lower z-level initially, # then find the real res of the data and # apply the appropriate max z-level # filepaths from pathlib import Path import os # visual checks & vector data wrangling import geopandas as gpd # staging import pdgstaging from pdgstaging import TileStager # rasterization & web-tiling import pdgraster from pdgraster import RasterTiler # logging from datetime import datetime import logging import logging.handlers from pdgstaging import logging_config # for transferring the log to workdir import subprocess from subprocess import Popen # -------------------------------------------------------------- data_dir = '/home/jcohen/infrastructure/data/' # define workflow configuration config = { "dir_input": data_dir, "ext_input": ".shp", "ext_footprints": ".shp", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 10 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "infrastructure_code", "weight_by": "area", "property": "DN", "aggregation_method": "max", # TODO: need to think about this one more, maybe should be "sum" or something else "resampling_method": "nearest", "palette": [ "#f48525", # orange "#f4e625", # yellow "#47f425", # green "#25f4e2", # turquoise "#2525f4", # blue "#f425c3", # pink "#f42525" # red ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": None, "deduplicate_keep_rules": None, "deduplicate_method": None } # -------------------------------------------------------------- print("Staging initiated.") # stage the tiles stager = TileStager(config) stager.stage_all() # -------------------------------------------------------------- print("Rasterizing and Web-tiling initiated.") # rasterize all staged tiles, resample to lower resolutions, # and produce web tiles RasterTiler(config).rasterize_all() # transfer log from /tmp to user dir # add subdirectories as needed user = subprocess.check_output("whoami").strip().decode("ascii") cmd = ['mv', '/tmp/log.log', f'/home/{user}/infrastructure/'] # initiate the process to run that command process = Popen(cmd) print("Script complete.") ```
julietcohen commented 6 months ago
clean_infrastructure.py ```python # Arctic infrastructure data from Annett B. # clean by removing rows with NA geometries # and split data into smaller files for processing # in parallel import geopandas as gpd import numpy as np data_path = "/home/jcohen/infrastructure/data/SACHI_v2.shp" data = gpd.read_file(data_path) # remove NA values from the geometry column data_clean = data[data['geometry'].notna()] # split the cleaned data into 6 subfiles # each file will have equal number of rows split_gdfs = np.array_split(data_clean, 6) for i, split_gdf in enumerate(split_gdfs): split_gdf.reset_index(inplace = True) # Create the filename with a different number ranging from 1 to 5 filename = f'/home/jcohen/infrastructure/data_cleaned_split/SACHI_v2_clean_{i + 1}.gpkg' split_gdf.to_file(filename, driver = "GPKG", index = True) print(f'Saved {filename}') print("Script complete.") ```
parsl script ```python # Processing the infrastructure data from Annett # see issue: https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/27 # staging through web tiling with parsl # conda environment: viz-local, with local installs # for viz-staging and viz-raster, but with modified viz-raster # because need to_image() fix (see PR), # and pip install for parsl==2023.11.27 import pdgstaging import pdgraster from datetime import datetime import json import logging import logging.handlers from pdgstaging import logging_config import os import parsl from parsl import python_app from parsl.config import Config from parsl.channels import LocalChannel from parsl.executors import HighThroughputExecutor from parsl.providers import LocalProvider import shutil import subprocess from subprocess import Popen user = subprocess.check_output("whoami").strip().decode("ascii") # start with a fresh directory! print("Removing old directories and files...") base_dir = "/home/jcohen/infrastructure/parsl_workflow/" old_filepaths = [f"{base_dir}staging_summary.csv", f"{base_dir}raster_summary.csv", f"{base_dir}raster_events.csv", f"{base_dir}config__updated.json", f"{base_dir}log.log"] for old_file in old_filepaths: if os.path.exists(old_file): os.remove(old_file) # remove dirs from past run old_dirs = [f"{base_dir}staged", f"{base_dir}geotiff", f"{base_dir}web_tiles", f"{base_dir}runinfo"] for old_dir in old_dirs: if os.path.exists(old_dir) and os.path.isdir(old_dir): shutil.rmtree(old_dir) activate_conda = 'source /home/jcohen/.bashrc; conda activate viz-local' htex_local = Config( executors=[ HighThroughputExecutor( label="htex_local", worker_debug=False, #cores_per_worker=1, max_workers=6, provider=LocalProvider( #channel=LocalChannel(), init_blocks=1, max_blocks=1, worker_init=activate_conda ), ) ], ) parsl.clear() parsl.load(htex_local) def run_pdg_workflow( workflow_config, batch_size = 300 ): """ Run the main PDG workflow for the following steps: 1. staging 2. raster highest 3. raster lower 4. web tiling Parameters ---------- workflow_config : dict Configuration for the PDG staging workflow, tailored to rasterization and web tiling steps only. batch_size: int How many staged files, geotiffs, or web tiles should be included in a single creation task? (each task is run in parallel) Default: 300 """ start_time = datetime.now() logging.info("Staging initiated.") stager = pdgstaging.TileStager(workflow_config) tile_manager = stager.tiles config_manager = stager.config input_paths = stager.tiles.get_filenames_from_dir('input') input_batches = make_batch(input_paths, batch_size) # Stage all the input files (each batch in parallel) app_futures = [] for i, batch in enumerate(input_batches): app_future = stage(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(input_batches)}') # Don't continue to next step until all files have been staged [a.result() for a in app_futures] logging.info("Staging complete.") # ---------------------------------------------------------------- # Create highest geotiffs rasterizer = pdgraster.RasterTiler(workflow_config) # Process staged files in batches logging.info(f'Collecting staged file paths to process...') staged_paths = tile_manager.get_filenames_from_dir('staged') logging.info(f'Found {len(staged_paths)} staged files to process.') staged_batches = make_batch(staged_paths, batch_size) logging.info(f'Processing staged files in {len(staged_batches)} batches.') app_futures = [] for i, batch in enumerate(staged_batches): app_future = create_highest_geotiffs(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(staged_batches)}') # Don't move on to next step until all geotiffs have been created [a.result() for a in app_futures] logging.info("Rasterization highest complete. Rasterizing lower z-levels.") # ---------------------------------------------------------------- # Rasterize composite geotiffs min_z = config_manager.get_min_z() max_z = config_manager.get_max_z() parent_zs = range(max_z - 1, min_z - 1, -1) # Can't start lower z-level until higher z-level is complete. for z in parent_zs: # Determine which tiles we need to make for the next z-level based on the # path names of the geotiffs just created logging.info(f'Collecting highest geotiff paths to process...') child_paths = tile_manager.get_filenames_from_dir('geotiff', z = z + 1) logging.info(f'Found {len(child_paths)} highest geotiffs to process.') # create empty set for the following loop parent_tiles = set() for child_path in child_paths: parent_tile = tile_manager.get_parent_tile(child_path) parent_tiles.add(parent_tile) # convert the set into a list parent_tiles = list(parent_tiles) # Break all parent tiles at level z into batches parent_tile_batches = make_batch(parent_tiles, batch_size) logging.info(f'Processing highest geotiffs in {len(parent_tile_batches)} batches.') # Make the next level of parent tiles app_futures = [] for parent_tile_batch in parent_tile_batches: app_future = create_composite_geotiffs( parent_tile_batch, workflow_config) app_futures.append(app_future) # Don't start the next z-level, and don't move to web tiling, until the # current z-level is complete [a.result() for a in app_futures] logging.info("Composite rasterization complete. Creating web tiles.") # ---------------------------------------------------------------- # Process web tiles in batches logging.info(f'Collecting file paths of geotiffs to process...') geotiff_paths = tile_manager.get_filenames_from_dir('geotiff') logging.info(f'Found {len(geotiff_paths)} geotiffs to process.') geotiff_batches = make_batch(geotiff_paths, batch_size) logging.info(f'Processing geotiffs in {len(geotiff_batches)} batches.') app_futures = [] for i, batch in enumerate(geotiff_batches): app_future = create_web_tiles(batch, workflow_config) app_futures.append(app_future) logging.info(f'Started job for batch {i} of {len(geotiff_batches)}') # Don't record end time until all web tiles have been created [a.result() for a in app_futures] end_time = datetime.now() logging.info(f'⏰ Total time to create all z-level geotiffs and web tiles: ' f'{end_time - start_time}') # ---------------------------------------------------------------- # Define the parsl functions used in the workflow: @python_app def stage(paths, config): """ Stage a file """ from datetime import datetime import json import logging import logging.handlers import os import pdgstaging from pdgstaging import logging_config stager = pdgstaging.TileStager(config = config, check_footprints = False) for path in paths: stager.stage(path) return True # Create highest z-level geotiffs from staged files @python_app def create_highest_geotiffs(staged_paths, config): """ Create a batch of geotiffs from staged files """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config # rasterize the vectors, highest z-level only rasterizer = pdgraster.RasterTiler(config) return rasterizer.rasterize_vectors( staged_paths, make_parents = False) # no need to update ranges because manually set val_range in config # ---------------------------------------------------------------- # Create composite geotiffs from highest z-level geotiffs @python_app def create_composite_geotiffs(tiles, config): """ Create a batch of composite geotiffs from highest geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config rasterizer = pdgraster.RasterTiler(config) return rasterizer.parent_geotiffs_from_children( tiles, recursive = False) # ---------------------------------------------------------------- # Create a batch of webtiles from geotiffs @python_app def create_web_tiles(geotiff_paths, config): """ Create a batch of webtiles from geotiffs """ from datetime import datetime import json import logging import logging.handlers import os import pdgraster from pdgraster import logging_config rasterizer = pdgraster.RasterTiler(config) return rasterizer.webtiles_from_geotiffs( geotiff_paths, update_ranges = False) # no need to update ranges because val_range is # defined in the config def make_batch(items, batch_size): """ Create batches of a given size from a list of items. """ return [items[i:i + batch_size] for i in range(0, len(items), batch_size)] # ---------------------------------------------------------------- # run the workflow workflow_config = '/home/jcohen/infrastructure/parsl_workflow/config.json' print("Loaded config. Running workflow.") run_pdg_workflow(workflow_config) # Shutdown and clear the parsl executor htex_local.executors[0].shutdown() parsl.clear() # transfer log from /tmp to user dir cmd = ['mv', '/tmp/log.log', f'/home/{user}/infrastructure/parsl_workflow/'] # initiate the process to run that command process = Popen(cmd) print("Script complete.") ```
config ``` { "dir_input": "/home/jcohen/infrastructure/data_cleaned_split", "ext_input": ".gpkg", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 12 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "infrastructure_code", "weight_by": "area", "property": "DN", "aggregation_method": "max", "resampling_method": "nearest", "val_range": [ 11, 50 ], "palette": [ "#f48525", "#f4e625", "#47f425", "#25f4e2", "#2525f4", "#f425c3", "#f42525" ], "nodata_val": 0, "nodata_color": "#ffffff00" } ], "deduplicate_at": null, "deduplicate_keep_rules": null, "deduplicate_method": null, "clip_to_footrpint": false } ```
elongano commented 5 months ago

Category: Infrastructure

julietcohen commented 5 months ago

Annett has provided the README for this dataset. It provides a code for each infrastructure type and is located in /var/data/submission/pdg/bartsch_infrastructure/

julietcohen commented 4 months ago

The dataset package on the ADC: https://arcticdata.io/catalog/view/urn%3Auuid%3A4e1ea0af-6f7c-4a7a-b69f-4e818e113c43

The final dataset has been processed and is archived on datateam:

julietcohen commented 4 months ago

viz-workflow v0.9.2 has been released (that was used to produce this data layer). The Only thing blocking this layer from moving to production is the ADC data package needs to be finished and published. I have been adding more metadata to this package the past week, so it is on it's way to getting assigned it's pre-issued DOI!

julietcohen commented 4 months ago

Justin from the ADC has helpfully taken over the rest of the metadata documentation for this dataset, starting this week.

julietcohen commented 3 months ago

Annett has requested that I change the palette of this dataset so that buildings are in red and water is in blue. I also noticed that there were MultiPolygons (only 2% of the geometries) in the input data from Annett. I re-cleaned this input data (still removed NA geoms like last time, with the additional cleaning step of exploding those MultiPolygon geoms) and will re-process the data with the viz workflow the requested palette change as well. This is not a time consuming operation on Delta.

This new cleaning script will replace the old one that is uploaded to the data package and the staged and geotiff tilesets will replace the ones currently in the pre-assigned DOI dir. I already discussed this with Justin.

For convenience, I am pasting the values for infrastructure code here (they are from the readme):

julietcohen commented 3 months ago

Annett also mentioned that she is going to update the dataset on Zenodo with the new version this week, and it already has a DOI there. Justin let her know we will still use our pre-assigned DOI since it is a new dataset, and I went into detail about our derived products. We will mention the Zenodo DOI in the dataset documentation on the ADC.

julietcohen commented 3 months ago

Delta decided that python doesn't exist anymore in any of my environments so I have been troubleshooting that for the past hour. Next step would be to uninstall VScode and reinstall, and remove the known hosts for Delta. I already tried uninstalling all python extensions and reinstalling.

julietcohen commented 3 months ago

Annett has updated the dataset based on my feedback regarding the geometries. She uploaded the new version of SACHI_v2 to Zenodo. She included the following info:

DOI 10.5281/zenodo.10160636 I made a number of changes based on your feedback: 1) I had a closer look at the geometry properties. The version which you got had duplicates and the overlap areas of the different sentinel-2 source granules were not yet merged. This is now solved and the there are now also no features without attributes any more 2) I extended the readme file for the meta data: S2_date1 to S2_date3 - dates of individual Sentinel-2 images used for averaging S1_winter - year(s) of Sentinel-1 images used for averaging (months December and/or January)

Justin and I made it clear that we would be giving the dataset a new DOI (the one I pre-created for this dataset) because our ADC version of the data package differs from the one she has on Zenodo, considering all our derived products.

Since she made changes to SACHI_v2, I will upload the new version to Datateam and reprocess the dataset with the viz workflow.

julietcohen commented 3 months ago

I reprocessed the infrastructure dataset with Annett's new version of SACHI_v2 data as input to the viz workflow. Since she fixed the geometries, I only had to split the multipolygons before inputting the data into the viz workflow. I also updated the palette to represent buildings in red and water in blue as Annett requested and removed yellow from the palette since we visualize the ice-wedge polygon layer in yellow, and we anticipate that users will explore these layers together. All input and output and the new pre-viz claning script have been uploaded to /var/data/10.18739/A21J97929. The output web tiles have been uploaded to /var/data/tiles/10.18739/A21J97929/

I updated the demo portal with this layer: image

julietcohen commented 3 months ago

I was able to refine the assignment of the categorical palette so there is just 1 color attributed to 1 infrastructure type. I did this in a similar way that I did for the permafrost and ground ice dataset. Instead of using the attribute DN for the visualized tiles, I needed to create a new attribute which I named palette_code that assigns numbers 1 through 7 to each of the DN numbers because the numbers for the palette assignment seem to need to be evenly spaced in order for 1 color to correspond to 1 categorical value. The values of DN are numerical but are not evenly spaced.

# add attribute that codes the categorical DN attribute
# into evenly spaced numbers in order to assign 
# palette correctly to the categories in the web tiles:
conditions = [
    (data['DN'] == 11),
    (data['DN'] == 12),
    (data['DN'] == 13),
    (data['DN'] == 20),
    (data['DN'] == 30),
    (data['DN'] == 40),
    (data['DN'] == 50)
]
choices = [1, 2, 3, 4, 5, 6, 7 ]
data['palette_code'] = np.select(conditions, choices)

As a result, I see more diversity in the colors of the geometries, as Annett suggested they should look, and I integrated grey for the 30 value of DN per her suggestion:

image

Taking this approach, I realize that in order for Annett's metadata for the DN values to match the numbers in the raster, I need to include both attributes (DN and palette_code) as bands in the rasters. This is because we will use the web tiles for palette_code as the layer we visualize on the portal, and users will want the option to use either band when they download the rasters for their own analysis.

One note is that I have only tried this dataset with a custom palette with each of the 7 hex codes specifically assigned rather than a pre-made palette with a name, because we wanted to include certiain colors in a specific order, and exclude certain colors that would too closely match other datasets on the PDG.

julietcohen commented 3 months ago

One more consideration for this dataset: the resolution of this data is not clear on the zenodo page (this new zenodo link is important because it points to version 2 of the dataset that Annett updated, not version 1 which was originally linked above when this ticket was first created). Sentinel data resolution can vary depending on the bands used. One way to find this may be to dive deeper into the methods, like a paper associated with this dataset, or ask Annett. I have been using z-level 12 as the max, cause that is ~32m resolution at latitude +/- 31

julietcohen commented 3 months ago

Annett has approved the layer to be moved to production. The remaining to-do items, in order:

mbjones commented 2 months ago

@julietcohen In reading through your correspondence on this ticket, I saw that it originated as a vector dataset. The most accurate way for us to present it would be as a vector layer, rather than raster. Let's please discuss this as I suspect it would provide a much better result. We have a number of vector layers in the portals now, mainly via a geoJSON conversion. I'm not sure if the dataset size would be prohibitive there, but let's discuss please.