Open initze opened 1 year ago
in progress --> will be updated from time-to-time
Here we have 2 datasets: GDrive Link: https://drive.google.com/drive/folders/1RH3G-u6mKRSZ82Fok9I1CPUpK8vj8EtH?usp=share_link
Both datasets can be joined via the ID_merged attribute
INitze_Lakes_merged_v3_PDG_Testset.gpkg
Lakes_IngmarPDG_annual_area.nc
Thank you for this dataset, metadata, and suggestions for processing, Ingmar! Very much looking forward to visualizing this on the PDG.
The 2 files for lake area time series have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/time_series_2023-04-05
Update: After rearranging Datateam directories, the data now is on Datateam at: /var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05
GeoPackage:
geometries that represent lake data in non-adjacent UTM zones:
projected CRS: EPSG 3995, WGS 84 Arctic Polar Stereographic
601,486 unique values for ID_merged
has 114,821 NA
values in each of the columns: raten_cm_y
& rateg_cm_y
, across 114,821 unique ID values
NetCDF:
ID_merged
NA
values for both attributes
NA
values span across all 38 years, but are only present for 100 ID values (38*100=3,800)Converted the NetCDF file to a dataframe:
Combined with the GeoPackage with an inner join, retaining only the ID_merged
values that are within both files. The merged data lives in:
/var/data/submission/pdg/nitze_lake_change/time_series_2023-04-05/merged_lakes_area.gpkg
Next step: Set up a parsl run to process this file for initial testing, with no deduplication and max z-level 11. The gpkg is quite large, so reading it in take a long time. When working with the entire lake change time series dataset (rather than just these few UTM zones) it is a good idea to separate the merged gpkg files by UTM zone so they are easier to parallelize with batches during staging.
explode
does not error in this case and works as expected.nodata_val
must be 0 until I merge the branch that allows the nodata_val
to be a different value, so best to choose an attribute to visualize that has all positive values for nowpermanent_water
(need units from Ingmar)permanent_water
, 2017-2021Anna requested that in anticipation of the google fellows starting on the team in January, our top priority should be to create annual layers for lake size for at least permanent_water
and also potentially seasonal_water
. Ingmar suggested that this be done for the past 5 years of data, because those years have the best quality observations and the fewest NA values for these attributes. I merged the data just as before, plus the step to remove NA values from these attributes that was not present before. I then subset the merged data (all transects) to 5 annual files.
At our meeting, Ingmar helpfully suggested using a mathematical way to distribute the color palette so that all colors are represented, even with skewed data. The way we have done it previously included plotting the input data attribute values in a binned histogram, and setting the range in the config to the min and max values that encompass the vast majority of the data, so the outliers are assigned the same color as the min and max values (depending on which side of the range they fall), and that way the data is much easier to interpret in a map. In order to attempt to introduce a more mathematical approach, I calculated the 95th quantile for each of the annual files. The values are similar for each attribute across the 5 years.
There are 2 new layers. One is the "permanent water", and the other is "seasonal water". Units are hectares.
@julietcohen Could you add the link to the dev version? I just lost it
for Ingmar: I did not use deduplication because of the lack of overlap I observed when plotting the data. Please correct me if I misinterpreted this, and I actually should use deduplication.
In your dataset, the four regions are separate from each other. Technically they were merged and deduplicated in a previous step. So here there should be no need for that.
the same blue palette was used for both statistics, but the numerical ranges of the data are different, since generally the permanent water values are larger than the seasonal water values
Maybe we can increase the max val (upper limit) of the visualization. For my taste the differences are not super great.
we could calculate the diff of each year versus the previous (or some kind of aggregated number of years), e.g. 2019-2018. Then we should be able to visualize major changes such as lake drainage 😄
@initze Thank you for your feedback. Here is the link to the PDG demo portal: https://demo.arcticdata.io/portals/permafrost
I'll increase the upper limit of the range in my config file and see if that helps create more diversity in colors associated with polygons of different sizes. For example, I could increase the quantile that I am using to set the max value. I was using the 95th quantile, which resulted in a max range value for permanent water of ~49. Increasing to the 99.99th quantile would be a max value of ~6,088.
Calculating the diff of each year would be very interesting! I do not think it would be difficult either, since I have already produced the annual files. I can ensure that each file has the same geometries, and we can take the difference for the variables of interest for each geometry. I can work on that after I visualize the 5 years independently.
Also, I think maybe a purple palette would be nice for the seasonal_water
to contrast the blue that we can keep for the permanent_water
. Just considering that the IWP layer is yellow, and the Arctic communities layer is green. Using red for any layer that represents a static attribute such as permanent or seasonal water seems like it would imply decreasing, so we should reserve red for the end of the spectrum for a variable that represents loss of size.
Another note: I have increased the maximum z-level for this data to z-12 (just 1 level higher) after consulting the metadata for the TMS, because we want the resolution of this data to be ~30m (correct me if that's wrong)
I created web tiles for 2017 with different palettes for permanent water and seasonal water, and used the 99.99th percentile for the highest value in the palette range to show more diversity of colors in the lower values. These are now on the demo portal.
Speaking about the visual colors of the polygons, we see that changing the range in the workflow config to the 99.99th percentile for the max value did succeed in showing more diversity for permanent water. In my opinion, the pink & purple palette for seasonal water does not show enough diversity in the smaller lakes, and I should re-process the web tiles with a different percentile for the max value for that attribute. Maybe @initze has feedback on this. We should also keep in mind that the more tailoring we have to do in order to find the best range for the config values to show the most color diversity, the further this workflow gets from being automated. Ideally, we would mathematically determine the best range of values for the config to optimize the color diversity in the polygons without having to guess and check each time. I would appreciate input from anyone who has a good way to achieve this.
The legends for these 2 layers are accurate, as the max value shown for each layer is indeed the max value for that attribute, not the 99.99th percentile. Additionally, both layers have a min value of 0, and this was the min value in the legend. However, the value range for permanent water is so large that we encounter the same issue we ran into with the Arctic Communities layer: when you hover over the legend, it shows scientific notation, which is accurate but not ideal. This was acceptable for the communities layer, so I assume this is acceptable for this layer, too, for now.
I have moved forward to process 2018-2021, and have already completed 2018 (update: 2018 geotiff data was corrupted during a directory transfer cancelled midway when VS code lost connection. geotiffs need to be re-created). However, there's no point in processing the web tiles for these years until we determine the best way to set the range for each attribute, based on our guess & checks for 2017. I will continue to process staged tiles and geotiffs for all years, but it would be very time consuming to guess and check for the optimal range of values to best represent the polygons.
We agreed that the 99.99th percentile for the blue palette for permanent water looks good. Increasing the percentile used for the max value of the range for the config reduced the amount of darkest blue color, and increased the amount of lighter shades in the polygons.
For the pink & purple palette for the seasonal water, we had the opposite problem: too much of the lighter shades, and not enough darker shades, so the approach is to reduce the percentile. I tried 92, but that was too much! This resulted in too many polygons with the darkest purple color. See the same region as in the last comment for comparison:
Then created web tiles with the 95th percentile for the highest value in the palette range. There is still too much purple:
But importantly, if we zoom in more to that same area pictured above, or in a different area such as the north slope of AK, we can see the 95th percentile works pretty well:
We can visualize 5 different percentiles with the data distribution here:
I have simplified and added documentation to the script used to clean, merge, and parse Ingmar's 2 input files into the 5 most recent years (merge_explode_annual.py
). I ran it to re-produce the annual input files for the visualization workflow. This script has been uploaded to the new ADC package I created for this data. Now that the input files for the viz workflow have been re-produced, I will start again with 2017 at staging through web tiling in one sweep, with the max value of the range for the permanent water being the 99.99th percentile, and the max value for seasonal water being the 95th percentile. I will do the same for the following 4 years. Keeping these percentiles the same for all years would be ideal.
All years 2017-2021 have been processed into staged, geotiffs, and web tiles for permanent water and seasonal water. All processing was done on Delta with the ray workflow. The 2017-2018 files have already completed the transfer to Datateam with the pre-issued DOI A28G8FK10
, and they are visualized on the demo portal. Years 2019-2021 have not yet fully transferred, Globus has been experiencing some NCSA endpoint authentication issues for the past week, making large transfers take >3 days and sometimes the transfers fail all together. So they will be up on the demo portal when those make it through!
Datasets for lake statistics aggregation
Regional Dataset
Jorgenson: Ecological Mapping and Permafrost Database for Northern Alaska (Final Ecological Mapping Update (2014))
Data Download Link (zipped shp)
Archive/Dataset Link: https://catalog.northslopescience.org/dataset/2236
For data aggregation I would propose to use the field "ECOREGION" and "LITHOLOGY" to start with, i guess once set up we could add others
Pan-Arctic Dataset
Olefeldt: Circumpolar Thermokarst Landscapes, 2015, Circum-Arctic
[Data Download Link (GPKG), with fixed geometries (I had some issues with original file)] https://1drv.ms/u/s!AobXXrP933xWh8lqz5It06Zf8AT9JA?e=yBQaDY
Archive/Dataset Link: https://apgc.awi.de/dataset/ctl-arctic
For data aggregation I would propose to use the field "TKThLP"