PermafrostDiscoveryGateway / pdg-portal

Design and mockup documents for the PDG portal
Apache License 2.0
0 stars 0 forks source link

Display the entire IWP layer #24

Open robyngit opened 2 years ago

robyngit commented 2 years ago

This issue is to track the progress of generating web tiles & 3dtiles for the entire Ice Wedge Polygon dataset

robyngit commented 2 years ago

Reminders:

julietcohen commented 1 year ago

High Ice IWP run 2/21 - Out of memory error on Delta error cancelled staging

Staging

image image

11,954 staged files were transferred to /scratch and job was cancelled after this, ran for 4.5 hours.

julietcohen commented 1 year ago

High Ice IWP run 2/22

Staging

image image image

Merging Staged

Went well overall, got an error output, but this is the only one I saw:

image

By the time merging concluded, the head node contained 15,113 files (2.03 GB)

Raster Highest

Raster Lower

Web Tiling

Done!

julietcohen commented 1 year ago

Investigating the scarcity of IWP in new web tiles

The web tiles produced by the new batch of IWP data are far more scarce than the last batch of web tiles.

Old IWP Data (2022) New IWP Data (2023)
22,319 shp files for Alaska, Canada, Russia 17,039 shp files for Alaska, Canada, Russia
4,267 for Alaska only 1,169 for Alaska only
9,606 for Canada only 7,198 for Canada only
8,446 for Russia only 8,172 for Russia only
5,356,353 web tiles 82,373 web tiles
ratio = ~240 web tiles created for every shp file ratio = ~5 web tiles created for every shp file

Feb 22, 2023 workflow run:

  1. staged 17,039 shp files
  2. merging resulted in 15,113 gpkg files
    • The point of merging is to combine all staged from all nodes into the head node, but we do not copy over staged files from worker nodes that are already present in the head node because that would result in overwriting the file that already exisst there. We do not want to overwrite the staged file in the head node, because even though these files may have the same name (tile ID), they may contain different polygons.
    • a) If a tile does not yet exist in the head, we simply move it there.
    • b) If a tile does exist in the head node, we check if the files in the nodes are identical
    • c) If the tiles are identical, we just skip copying the file to the head node.
    • d) If the tiles are not identical, we append the polygons into one gdf and save that file to the head node.
    • This results in the number of staged tiles in the head node being the total sum of all staged tiles in all nodes minus the tiles that were already present in the head node.
  3. raster highest produced 15,114 tif files at z-level 15
  4. raster lower produced 82,272 tif files (sum of all z-levels)
  5. web tiling produced 82,272 png files (sum of all z-levels)
julietcohen commented 1 year ago

Update on high, med, and low ice processing

High ice has been processed completely and is up on the production portal, with a link to the published high ice package archived on the ADC.

Low and medium ice are in progress on Delta. We upgraded the allocation to a higher tier (Explore --> Discover) and exchanged enough ACCESS credits into storage and GPU hours to process all of low and medium together, through all steps, without having to transfer files in between steps from Delta to Datateam and then remove them from Delta to save space.

Staging medium and low went smoothly. Used 20 nodes and transferred all 8,247,460 staged files to /scratch.

Merging is going somewhat smoothly, but running into the same errors as documented before, which are rare compared to the successful merges. As a reminder, if a tile is not present in the head node but is present in a worker node, the file is copied from a worker node to the head node. If the tile is in the head node but is different than the same tile in the worker node, the tile is merged (deduplicated) from the worker node into the head node. Some of the errors printed in the terminal are below:

image image

When I investigated these errors before, I was not able to find the source. Finding out why certain files are corrupted during the merge is a high priority for improving the workflow.

0 errors were documented during staging.

julietcohen commented 1 year ago

Update on IWP for all regions

IWP for all regions (high, low, medium) have been processed through staging, merging, rasterization, and web-tiling. The high region was processed separately from low and medium (which were processed together) on Delta because of memory limitations (especially during merging step) and job time limitations (we can only process so many files within the max hours of processing allowed per job, and each step needs to complete within 1 job because checkpoints were not built into the ray workflow).

The IWP tiles are therefore within 2 layers, displayed on the demo portal:

image

The deduplication within each of the two workflow runs went well. Because the merging occurred within each run and has not been executed for all 3 regions together, there are strips of duplicated tiles where the high region overlaps with either the low region or the medium region. An example from northern Alaska:

image

We discussed different approaches to combine them. I would have to obtain more credits (easy to do) to merge them together on Delta (the step that takes the longest even when the regions were processed separately), but I would likely hit the memory and job time limitations (not related to credits). The other solution is to do it on an NCEAS machine, which will remove the time limit limitation and potentially the memory limitation as well, but I will need to adjust the code to work in parallel on that machine.

Anna's comment: I would wait to publish the new data once you have it merged with the already published data and then call it v2 (and a new DOI). The dataset that is published and that is up on the PDG is enough for people to understand what the data is about.

elongano commented 8 months ago

Category: Permafrost Subsurface Features

julietcohen commented 2 months ago

IWP dataset on Google Kubernetes Engine

With the successful execution of a small run of the kubernetes & parsl workflow on the Google Kubernetes Engine (GKE) (nice work @shishichen! 🎉), we have an updated game plan for processing the entire IWP workflow (high, med, and low) within 1 run (with deduplication between all regions and adjacent files).

  1. I will follow Shishi's documentation to execute my own viz workflow run with the few IWP tiles, which will allow me to accept her pull request into the viz-workflow repo develop branch
  2. Shishi and I will meet to discuss any other workflow parameters outside of the viz config (there may be certain decisions that are specific to GKE, such as how many workers to use, if we can fit all steps into 1 run or not if there is a time restraint, etc.)
  3. Either of us or both of us together will run the GKE workflow on a larger subset of data, like just the high ice subset in Alaska, and closely monitor the job to ensure processing is running in parallel, no files are lost, and tracking how many credits are burned for just that region
  4. do some math to make sure we will have enough GKE credits for the full run
  5. execute the full run
mbjones commented 2 months ago

@julietcohen @shishichen a quick thought as we're preparing for this layer integration - this is probably obvious to you, but I thought I'd throw it out there just in case. As the high, medium, and low images have been tiled and deduplicated separately, we need to combine the two output datasets, dealing with duplicate polygons. I think the main issue is that we need to deduplicate the regions where High data overlaps with Med/Low data. This is not the whole dataset, and should primarily be on the boundaries of where the datasets overlap. If we query to find the list of tiles/images that overlap at the boundaries of those datasets, that list should be much smaller than the full list of all dataset images and tiles, and would save a huge amount of processing time, at the cost of a more complicated selection process for images and then a merging process of old and new tiles.

As an example, I made up the following scenario with High (grey) and Med/Low (salmon) images. In this case, only images H1, H2, ML1, and ML2 need to be reprocessed, and they only affect the tiles in rows 3 and 4 -- the tiles in rows 1, 2, 5, and 6 can be copied across straight to the output dataset without any reprocessing. All of this can be determined ahead of time via calculations on the image footprints, which should be very fast. Does that make sense to you? One thing I wondered about was whether the images like H3 that overlap H1 in row 3 would have an impact on tile row 3. Need to think about that.

image

julietcohen commented 1 month ago

Thanks for the description and the visual, Matt! That all aligns with my understanding as well.

Reminders for where the data is stored and published:

DOI: A2KW57K57

This DOI is associated with the published metadata package that will be updated with the tiles that have all been deduplicated between high, med, and low.

image

Datateam:/var/data/10.18739/A2KW57K57/ contains all regions of the IWP detections and footprints (high, med, low)

Datateam:/var/data/10.18739/A2KW57K57/iwp_geopackage_high contains only the high ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A2KW57K57/iwp_geotiff_high contains only the high ice output of geotiff tiles from the viz workflow

DOI: A24F1MK7Q

This DOI is not associated with a metadata package. This DOI only exists as a subdirectory within Datateam:/var/data/10.18739/ in order to organize the output for low and medium regions.

image

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geopackage_low_medium contains only the low and medium ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geotiff_low_medium contains only the low and medium ice output of geotiff tiles from the viz workflow