Display the entire IWP layer

PermafrostDiscoveryGateway / pdg-portal

Design and mockup documents for the PDG portal

Apache License 2.0

0 stars 0 forks source link

Display the entire IWP layer #24

Open robyngit opened 2 years ago

robyngit commented 2 years ago

This issue is to track the progress of generating web tiles & 3dtiles for the entire Ice Wedge Polygon dataset

robyngit commented 2 years ago

Reminders:

[ ] Update the configuration to remove the Type property once https://github.com/PermafrostDiscoveryGateway/viz-staging/issues/8 is complete
[x] Update the configuration to keep duplicates in the archived geopackage tiles once https://github.com/PermafrostDiscoveryGateway/viz-staging/issues/9 is complete

julietcohen commented 1 year ago

High Ice IWP run 2/21 - Out of memory error on Delta error cancelled staging

scheduled job to process all shapefiles for Alaska, Canada, & Russia on 5 nodes, 24 hours
used modified viz-raster to avoid ray error and still be able to write raster_summary.csv to update ranges in web tiling step

Staging

files written to /tmp on each node.
CPU usage looked good at the start, fluctuated between 40-60% on the head and worker nodes:
Memory usage creeped up to ~93% after staging about half of the high ice files and stabilized there. I/O wait also increased (staged files were written to /tmp slower as time went on).

11,954 staged files were written across all 5 nodes before the connection was lost. Total amount of files to process for high ice is 17,039.

11,954 staged files were transferred to /scratch and job was cancelled after this, ran for 4.5 hours.

julietcohen commented 1 year ago

High Ice IWP run 2/22

scheduled job to process all shapefiles for Alaska, Canada, & Russia on 11 nodes, 20 hours
increasing # of nodes hopefully decreases the amount of memory used on each node, avoiding the memory leak
could not do more nodes or hours based on the amount of CPU credits we have remaining

Staging

memory increased on each node to ~70% after processing ~50% of the files, then increased to close to 90% by the time 70% of the files were staged.
CPU is very high, started at >90% and stabilized at ~70% per node

By the time the vast majority of the files were staged, the CPU dropped to <1% on all nodes according to glances.
tmux output shows staged files are still being written quickly. Staged completed in less than 2 hours.

Merging Staged

Went well overall, got an error output, but this is the only one I saw:

By the time merging concluded, the head node contained 15,113 files (2.03 GB)

Raster Highest

Went very well, but suspiciously fast. 1.69 minutes to rasterize all 15,113 raster highest files.
3 nodes have /tmp/geotiff dirs this time, while in all my practice runs only the head node ever created a /tmp/geotiff dir 🤷🏻‍♀️

Raster Lower

Went very well, but again suspiciously fast. 5.84 minutes to rasterize all parent geotiffs.
wrote directly to scratch
total number of files in geotiff dir in all z-levels: 82,373
wrote raster_summary.csv, but messed up the formatting. This ocurred before, and is noted in an issue here. I downloaded the csv, removed the few lines that were misformatted (just split data between cells worng) and uploaded the clean version to /scratch.

Web Tiling

took 8 minutes
created 82,373 web tiles

Done!

julietcohen commented 1 year ago

Investigating the scarcity of IWP in new web tiles

The web tiles produced by the new batch of IWP data are far more scarce than the last batch of web tiles.

Old IWP Data (2022)	New IWP Data (2023)
22,319 shp files for Alaska, Canada, Russia	17,039 shp files for Alaska, Canada, Russia
4,267 for Alaska only	1,169 for Alaska only
9,606 for Canada only	7,198 for Canada only
8,446 for Russia only	8,172 for Russia only
5,356,353 web tiles	82,373 web tiles
ratio = ~240 web tiles created for every shp file	ratio = ~5 web tiles created for every shp file

Feb 22, 2023 workflow run:

staged 17,039 shp files
merging resulted in 15,113 gpkg files
- The point of merging is to combine all staged from all nodes into the head node, but we do not copy over staged files from worker nodes that are already present in the head node because that would result in overwriting the file that already exisst there. We do not want to overwrite the staged file in the head node, because even though these files may have the same name (tile ID), they may contain different polygons.
- a) If a tile does not yet exist in the head, we simply move it there.
- b) If a tile does exist in the head node, we check if the files in the nodes are identical
- c) If the tiles are identical, we just skip copying the file to the head node.
- d) If the tiles are not identical, we append the polygons into one gdf and save that file to the head node.
- This results in the number of staged tiles in the head node being the total sum of all staged tiles in all nodes minus the tiles that were already present in the head node.
raster highest produced 15,114 tif files at z-level 15
raster lower produced 82,272 tif files (sum of all z-levels)
web tiling produced 82,272 png files (sum of all z-levels)

julietcohen commented 1 year ago

Update on high, med, and low ice processing

High ice has been processed completely and is up on the production portal, with a link to the published high ice package archived on the ADC.

Low and medium ice are in progress on Delta. We upgraded the allocation to a higher tier (Explore --> Discover) and exchanged enough ACCESS credits into storage and GPU hours to process all of low and medium together, through all steps, without having to transfer files in between steps from Delta to Datateam and then remove them from Delta to save space.

Staging medium and low went smoothly. Used 20 nodes and transferred all 8,247,460 staged files to /scratch.

Merging is going somewhat smoothly, but running into the same errors as documented before, which are rare compared to the successful merges. As a reminder, if a tile is not present in the head node but is present in a worker node, the file is copied from a worker node to the head node. If the tile is in the head node but is different than the same tile in the worker node, the tile is merged (deduplicated) from the worker node into the head node. Some of the errors printed in the terminal are below:

When I investigated these errors before, I was not able to find the source. Finding out why certain files are corrupted during the merge is a high priority for improving the workflow.

0 errors were documented during staging.

julietcohen commented 1 year ago

Update on IWP for all regions

IWP for all regions (high, low, medium) have been processed through staging, merging, rasterization, and web-tiling. The high region was processed separately from low and medium (which were processed together) on Delta because of memory limitations (especially during merging step) and job time limitations (we can only process so many files within the max hours of processing allowed per job, and each step needs to complete within 1 job because checkpoints were not built into the ray workflow).

The IWP tiles are therefore within 2 layers, displayed on the demo portal:

The deduplication within each of the two workflow runs went well. Because the merging occurred within each run and has not been executed for all 3 regions together, there are strips of duplicated tiles where the high region overlaps with either the low region or the medium region. An example from northern Alaska:

We discussed different approaches to combine them. I would have to obtain more credits (easy to do) to merge them together on Delta (the step that takes the longest even when the regions were processed separately), but I would likely hit the memory and job time limitations (not related to credits). The other solution is to do it on an NCEAS machine, which will remove the time limit limitation and potentially the memory limitation as well, but I will need to adjust the code to work in parallel on that machine.

Anna's comment: I would wait to publish the new data once you have it merged with the already published data and then call it v2 (and a new DOI). The dataset that is published and that is up on the PDG is enough for people to understand what the data is about.

elongano commented 8 months ago

Category: Permafrost Subsurface Features

julietcohen commented 2 months ago

IWP dataset on Google Kubernetes Engine

With the successful execution of a small run of the kubernetes & parsl workflow on the Google Kubernetes Engine (GKE) (nice work @shishichen! 🎉), we have an updated game plan for processing the entire IWP workflow (high, med, and low) within 1 run (with deduplication between all regions and adjacent files).

I will follow Shishi's documentation to execute my own viz workflow run with the few IWP tiles, which will allow me to accept her pull request into the viz-workflow repo develop branch
Shishi and I will meet to discuss any other workflow parameters outside of the viz config (there may be certain decisions that are specific to GKE, such as how many workers to use, if we can fit all steps into 1 run or not if there is a time restraint, etc.)
Either of us or both of us together will run the GKE workflow on a larger subset of data, like just the high ice subset in Alaska, and closely monitor the job to ensure processing is running in parallel, no files are lost, and tracking how many credits are burned for just that region
do some math to make sure we will have enough GKE credits for the full run
execute the full run

mbjones commented 2 months ago

@julietcohen @shishichen a quick thought as we're preparing for this layer integration - this is probably obvious to you, but I thought I'd throw it out there just in case. As the high, medium, and low images have been tiled and deduplicated separately, we need to combine the two output datasets, dealing with duplicate polygons. I think the main issue is that we need to deduplicate the regions where High data overlaps with Med/Low data. This is not the whole dataset, and should primarily be on the boundaries of where the datasets overlap. If we query to find the list of tiles/images that overlap at the boundaries of those datasets, that list should be much smaller than the full list of all dataset images and tiles, and would save a huge amount of processing time, at the cost of a more complicated selection process for images and then a merging process of old and new tiles.

As an example, I made up the following scenario with High (grey) and Med/Low (salmon) images. In this case, only images H1, H2, ML1, and ML2 need to be reprocessed, and they only affect the tiles in rows 3 and 4 -- the tiles in rows 1, 2, 5, and 6 can be copied across straight to the output dataset without any reprocessing. All of this can be determined ahead of time via calculations on the image footprints, which should be very fast. Does that make sense to you? One thing I wondered about was whether the images like H3 that overlap H1 in row 3 would have an impact on tile row 3. Need to think about that.

julietcohen commented 1 month ago

Thanks for the description and the visual, Matt! That all aligns with my understanding as well.

Reminders for where the data is stored and published:

DOI: A2KW57K57

This DOI is associated with the published metadata package that will be updated with the tiles that have all been deduplicated between high, med, and low.

Datateam:/var/data/10.18739/A2KW57K57/ contains all regions of the IWP detections and footprints (high, med, low)

Datateam:/var/data/10.18739/A2KW57K57/iwp_geopackage_high contains only the high ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A2KW57K57/iwp_geotiff_high contains only the high ice output of geotiff tiles from the viz workflow

DOI: A24F1MK7Q

This DOI is not associated with a metadata package. This DOI only exists as a subdirectory within Datateam:/var/data/10.18739/ in order to organize the output for low and medium regions.

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geopackage_low_medium contains only the low and medium ice output of staged tiles from the viz workflow

Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geotiff_low_medium contains only the low and medium ice output of geotiff tiles from the viz workflow