Open robyngit opened 2 years ago
Reminders:
viz-raster
to avoid ray error and still be able to write raster_summary.csv
to update ranges in web tiling stepfiles written to /tmp
on each node.
CPU usage looked good at the start, fluctuated between 40-60% on the head and worker nodes:
Memory usage creeped up to ~93% after staging about half of the high ice files and stabilized there. I/O wait also increased (staged files were written to /tmp
slower as time went on).
11,954 staged files were transferred to /scratch and job was cancelled after this, ran for 4.5 hours.
Went well overall, got an error output, but this is the only one I saw:
By the time merging concluded, the head node contained 15,113 files (2.03 GB)
/tmp/geotiff
dirs this time, while in all my practice runs only the head node ever created a /tmp/geotiff
dir 🤷🏻♀️Done!
The web tiles produced by the new batch of IWP data are far more scarce than the last batch of web tiles.
Old IWP Data (2022) | New IWP Data (2023) |
---|---|
22,319 shp files for Alaska, Canada, Russia | 17,039 shp files for Alaska, Canada, Russia |
4,267 for Alaska only | 1,169 for Alaska only |
9,606 for Canada only | 7,198 for Canada only |
8,446 for Russia only | 8,172 for Russia only |
5,356,353 web tiles | 82,373 web tiles |
ratio = ~240 web tiles created for every shp file | ratio = ~5 web tiles created for every shp file |
Feb 22, 2023 workflow run:
High ice has been processed completely and is up on the production portal, with a link to the published high ice package archived on the ADC.
Low and medium ice are in progress on Delta. We upgraded the allocation to a higher tier (Explore --> Discover) and exchanged enough ACCESS credits into storage and GPU hours to process all of low and medium together, through all steps, without having to transfer files in between steps from Delta to Datateam and then remove them from Delta to save space.
Staging medium and low went smoothly. Used 20 nodes and transferred all 8,247,460 staged files to /scratch
.
Merging is going somewhat smoothly, but running into the same errors as documented before, which are rare compared to the successful merges. As a reminder, if a tile is not present in the head node but is present in a worker node, the file is copied from a worker node to the head node. If the tile is in the head node but is different than the same tile in the worker node, the tile is merged (deduplicated) from the worker node into the head node. Some of the errors printed in the terminal are below:
When I investigated these errors before, I was not able to find the source. Finding out why certain files are corrupted during the merge is a high priority for improving the workflow.
0 errors were documented during staging.
IWP for all regions (high, low, medium) have been processed through staging, merging, rasterization, and web-tiling. The high region was processed separately from low and medium (which were processed together) on Delta because of memory limitations (especially during merging step) and job time limitations (we can only process so many files within the max hours of processing allowed per job, and each step needs to complete within 1 job because checkpoints were not built into the ray workflow).
The IWP tiles are therefore within 2 layers, displayed on the demo portal:
The deduplication within each of the two workflow runs went well. Because the merging occurred within each run and has not been executed for all 3 regions together, there are strips of duplicated tiles where the high region overlaps with either the low region or the medium region. An example from northern Alaska:
We discussed different approaches to combine them. I would have to obtain more credits (easy to do) to merge them together on Delta (the step that takes the longest even when the regions were processed separately), but I would likely hit the memory and job time limitations (not related to credits). The other solution is to do it on an NCEAS machine, which will remove the time limit limitation and potentially the memory limitation as well, but I will need to adjust the code to work in parallel on that machine.
Anna's comment: I would wait to publish the new data once you have it merged with the already published data and then call it v2 (and a new DOI). The dataset that is published and that is up on the PDG is enough for people to understand what the data is about.
Category: Permafrost Subsurface Features
With the successful execution of a small run of the kubernetes & parsl workflow on the Google Kubernetes Engine (GKE) (nice work @shishichen! 🎉), we have an updated game plan for processing the entire IWP workflow (high, med, and low) within 1 run (with deduplication between all regions and adjacent files).
viz-workflow
repo develop branch@julietcohen @shishichen a quick thought as we're preparing for this layer integration - this is probably obvious to you, but I thought I'd throw it out there just in case. As the high, medium, and low images have been tiled and deduplicated separately, we need to combine the two output datasets, dealing with duplicate polygons. I think the main issue is that we need to deduplicate the regions where High data overlaps with Med/Low data. This is not the whole dataset, and should primarily be on the boundaries of where the datasets overlap. If we query to find the list of tiles/images that overlap at the boundaries of those datasets, that list should be much smaller than the full list of all dataset images and tiles, and would save a huge amount of processing time, at the cost of a more complicated selection process for images and then a merging process of old and new tiles.
As an example, I made up the following scenario with High (grey) and Med/Low (salmon) images. In this case, only images H1, H2, ML1, and ML2 need to be reprocessed, and they only affect the tiles in rows 3 and 4 -- the tiles in rows 1, 2, 5, and 6 can be copied across straight to the output dataset without any reprocessing. All of this can be determined ahead of time via calculations on the image footprints, which should be very fast. Does that make sense to you? One thing I wondered about was whether the images like H3 that overlap H1 in row 3 would have an impact on tile row 3. Need to think about that.
Thanks for the description and the visual, Matt! That all aligns with my understanding as well.
Reminders for where the data is stored and published:
This DOI is associated with the published metadata package that will be updated with the tiles that have all been deduplicated between high, med, and low.
Datateam:/var/data/10.18739/A2KW57K57/
contains all regions of the IWP detections and footprints (high, med, low)
Datateam:/var/data/10.18739/A2KW57K57/iwp_geopackage_high
contains only the high ice output of staged tiles from the viz workflow
Datateam:/var/data/10.18739/A2KW57K57/iwp_geotiff_high
contains only the high ice output of geotiff tiles from the viz workflow
This DOI is not associated with a metadata package. This DOI only exists as a subdirectory within Datateam:/var/data/10.18739/
in order to organize the output for low and medium regions.
Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geopackage_low_medium
contains only the low and medium ice output of staged tiles from the viz workflow
Datateam:/var/data/10.18739/A24F1MK7Q/iwp_geotiff_low_medium
contains only the low and medium ice output of geotiff tiles from the viz workflow
This issue is to track the progress of generating web tiles & 3dtiles for the entire Ice Wedge Polygon dataset