Open robyngit opened 2 years ago
I processed the sample data from Ingmar using the same workflow we created for the IWP polygons, creating both PNG web tiles and 3D tiles. Everything ran very smoothly. The output is currently displayed on the demo portal:
Notes:
New data Package
This styling works nicely with a black background map (CartoDB Dark Matter, or similar)
File: lake_change_grid_3000_netchange.tif lake_change_grid_3000_netchange.tif
They are similar to Webb et al., 2022 (Surface water index Trend) Please just use the raster as is, it's already designed to have aggregated statistics per pixel.
Palette: RdBu Range: -2:+2 NoData = 0
Ingmar uploaded 5 zip files that contain lake change data to a Google Drive folder here.
data_products_32635-32640
Per our visualization meeting discussion on 4/3, the highest priority is to process the data in one of the 5 directories, taking Ingmar's color suggestions into consideration. Next, we will move onto processing the other 4 directories and finally Ingmar's newer data, documented in issue#37.
Update: These 5 directories have been uploaded to the NCEAS datateam server: /home/pdg/data/nitze_lake_change/data_2022_11_04/lake_change_GD
Initially tried to stage all 6 lake_change.gpkg
files within data_products_32635-32640
, but staging was taking hours and was producing a surprising amount of files from just one of the input gpkg file: data_products_32635-32640/32640/lake_change.gpkg
, so restarted the process as a tmux script to only stage this one UTM zone.
viz-staging
We love a suspenseful mystery!
IN_PROGRESS_VIZ_WORKFLOW.py
and other processing scripts with adjustments Web tile visualization on local Cesium:
To do:
property
used for the statistic
to ChangeRateNet_myr-1
changing the property
will likely require us to adjust other config options as well:
potentially weight_by
(depending on units of ChangeRateNet_myr-1
), but likely keep this at area
rather than changing to count
when using ChangeRateNet_myr-1
aggregation_method
to mode
so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel
updates:
mode
is not an option. During rasterization step for highest z-level, log reports: ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'
. See options here.max
and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract x = asanyarray(arr - arrmean)
and
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)
These statements are likely resulting from errors in rows of the raster_summary.csv
with invalid cell values of inf and -inf, represented like so:
potentially resampling_method
to mode
so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent
adjusting the config for a property of interest was described in this issue
palette
and statistic
is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)Ran the workflow through web-tiling with statistic
ran on property
of the data, rather than the coverage
as normal for IWP data. Used the attribute ChangeRateGrowth_myr-1
because I was testing if using an attribute with all positive values would resolve the inf and -inf values shown in the raster_summary.csv
pictured above. Unfortunately, there were still many inf and -inf values produced, which resulted in failure to produce many lower resolution rasters from z-11 rasters. I plotted anyway to get an idea of how the colors appear when mapped to this attribute. I used a color palette with 7 colors that range from red to yellow to blue depending on the lake growth:
Config is here.
Notes:
I determined the source of those inf
and -inf
values: they are present in the input data. I back traced them from the raster summary to the staged tiles and finally the 36 input lake_change.gpkg
files, each from one UTM zone. There are also NaN
values in each gpkg. These values might represent no data, or could be errors.
After discovering inf
and NaN
values within one file, I ran a script to check for them in every file, see below.
These values are present in several columns, including but not necessarily limited to:
@initze : Would you recommend that I remove rows with inf
and NaN
values, or replace these values with something else? Alternatively, would you like to re-process the data to avoid calculating these values in the first place?
Ingmar is looking into the source of the inf
and NaN
values, and he will get back to us about how he wants to move forward.
Hi @julietcohen . I checked a few of the files and could find some of the affected polygons. It seems that the files didn't run through a specific filter. The affected features/lakes are all very small, thus running into 0 divisions and other stuff.
delete rows with NaN for now I will apply a filter in the next version.
I hope that fixes your issue
Cheers Ingmar
To do:
* change the `property` used for the `statistic` to `ChangeRateNet_myr-1` * changing the `property` will likely require us to adjust other config options as well: * potentially `weight_by` (depending on units of `ChangeRateNet_myr-1`), but likely keep this at `area` rather than changing to `count` when using `ChangeRateNet_myr-1` * `aggregation_method` to `mode` so the color of that pixel represents the most common rate change present in the pixel, rather than the average or sum of the rate changes in that pixel * updates: * `mode` is not an option. During rasterization step for highest z-level, log reports: `ERROR:pdgraster.RasterTiler:'SeriesGroupBy' object has no attribute 'mode'`. See options [here](https://sparkbyexamples.com/pandas/pandas-aggregate-functions-with-examples/#:~:text=What%20are%20pandas%20aggregate%20functions,form%20a%20single%20summary%20value.). * `max` and 'mean' are options but while still rasterizing z-11, printed these 2 statements in terminal:
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:232: RuntimeWarning: invalid value encountered in subtract x = asanyarray(arr - arrmean)
and
/home/jcohen/anaconda3/envs/arcade_layer/lib/python3.9/site-packages/numpy/core/_methods.py:48: RuntimeWarning: invalid value encountered in reduce return umr_sum(a, axis, dtype, out, keepdims, initial, where)
These statements are likely resulting from errors in rows of the
raster_summary.csv
with invalid cell values of inf and -inf, represented like so:* potentially `resampling_method` to `mode` so that when we produce lower resolution rasters, the color for the pixel represents the most common rate change for that extent, rather than the average of the rate changes for all the lakes in that extent * adjusting the config for a property of interest was described in [this issue](https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/9) * adjust color palette: red and blue in the data above are not meaningful in the way we intend them to be; the blue likely represents 100% coverage, and the red likely represents <100% coverage because the red is used for pixels at the edge of the lake polygons. We should use a _gradient_ from red to yellow to blue rather than 2 discrete colors, which Ingmar suggested above [here](https://github.com/PermafrostDiscoveryGateway/pdg-portal/issues/28#issuecomment-1275936775). The color palette must be accepted by [colormaps](https://pratiman-91.github.io/colormaps). * visualize 2 UTM zones to allow us to check if the neighbors deduplication is working well * once the `palette` and `statistic` is figured out, change z-level to 12 to see if that is better for the data resolution (~30m)
Thanks for looking into this, Ingmar. I'll remove all rows with NaN
, or inf
values and move forward with processing this data with a statistic for the ChangeRateNet_myr-1
attribute.
@tcnichol is ready to start moving this data to our datateam server. We need to discuss where he will store the data and how he will transfer it from delta (globus)? Estimated to be around 500GB, including a lot of intermediary products.
He should store it in the same ~pdg/data staging directory that we've been using for IWP. Juliet had trouble getting globus to write directly there, which is why the ~jscohen account was created. There is also a pdg
account, and we should talk to Nick to clarify if that can be used to do a globus transfer directly into that location - I think it should work. Let's discuss on slack.
@mbjones: Todd is curious if there is an update on the Globus --> Datateam data transfer situation. If we have enabled this for users without needing to give them access to jscohen
, please let us know how he can do so.
I talked with Nick about moving the home directory of pdg
to be /home/shares/pdg
, which would mean its home directory is on the same fiesystem that is used for our PDG data storage. That would enable globus logins with the pdg
account shared by the project to be able to transfer data directly to where it needs to be. That should eliminate the need for the jscohen
account altogether. Check with Nick on the status of those changes.
Great, thank you
Update on cleaning lake change data before visualization:
Cleaning the lake change data provided in November 2022, located in: /home/pdg/data/nitze_lake_change/data_2022-11-04/lake_change_GD/...
Ingmar requested that the rows of each of the 46 lake_change.gpkg
that contain any NA
or inf
value be removed prior to visualization, since I noted that this causes issues for the rasterization step in which we are calculating stats on an attribute of the input data.
In order to document the rows that contain NA
and/or inf
values and therefore need to be removed, the following script writes a csv
file for each lake_change.gpkg
with the rows (and their indices) that contain NA
or inf
values. These are saved within a directory with a hierarchy that matches Ingmar's hierarchy so it's clear which csv
documents the invalid rows for to which originallake_change.gpkg
.
Ideally for the config, we would set both aggregation_method
and resampling_method
to mode
when visualizing the attribute ChangeRateNet_myr-1
but we must use something else besides mode
for aggregation_method
until that functionality is built into the workflow. For testing, we can try max
.
This attribute ranges from approximately -10 to +5 with most values ranging from -2 to +2 (for the 2 UTM zones used for testing):
Therefore we should also include a val_range
in the config [-2,2]
. This keeps the palette consistent across z-levels, so lakes or portions of lakes do not change color as you zoom in. More extreme values that fall outside that range are assigned the color for -2 or 2.
To do:
viz-staging
nodata_value
and removing deduplicationThe last few rounds of lake change visualization for the same UTM zones tested the following two changes in the config:
nodata_value
to 999
and None
(instead of 0 which is used for IWP data, because 0 is a possible value for the visualized attribute in this lake change data)min
instead of max
for the aggregation_method
nodata_value
to 999
and None
Both of these nodata_value
s were problematic, resulting in a yellow layer over the tiles that spans from the prime meridian to the antimeridian:
This may be related to the fact that these values 999
and None
are not present in the data at all, so the nodata_mask
is of length 0 when we execute to_image()
in viz-raster
here. Because the nodata_value
is not set to 0 anymore, the 0 values in the data (yellow in our palette) are not made into the mask, so they are retained. Consider the IWP config, where we do set the nodata_value
to 0. The cells that have 0% coverage or 0 polygons are set to transparent. However, this still does not explain why the yellow layer expands all the way around half of the world, and is not limited to the region of the 2 UTM zones. The same dir of staged tiles is used to produce both web tile sets that do and do not have the yellow layer, so I suppose this means that the yellow layer is only in the web-tiles, and not the staged tiles themselves. More testing is needed to narrow down what's happening here.
In the web tiles, we still see the gap that I assume represents the border of the two UTM zones visualized:
The next step to figure out what's going on here is to plot all the lake data for both UTM zones and see if they overlap. Ingmar expressed that there certainly is overlap and there needs to be deduplication integrated either before the viz-workflow or during the viz-workflow.
min
instead of max
for aggregationThis makes more sense considering that the most extreme values of the distribution for this attribute is negative, rather than positive. Hopefully this more correctly conveys the environmental trends shown in the data and shows more diverse values in the palette.
After looking into the code for the neighbors deduplication approach, as well as plotting the input data overlap, here are some observations:
True
values for the staging_duplicated
attribute in the staged
tilesDate
attribute in the input datadeduplicate_keep_rules
as the IWP dataset ([["Date","larger"]]
) when I should use a property that is in the datadeduplicate_keep_rules
should not be the property I'm are visualizing, because I can choose between larger
and smaller
(see here), and this property has both positive and negative values, so would be more appropriate to be [["Perimeter_meter","larger"]]
which has all positive valuesdeduplicate_keep_rules
is not present in the data (to be determined where this error should go exactly, and if the workflow is already set up to raise an error in this scenario, why it was not raised in the ray workflow specifically)Changing the deduplicate_...
config options corrected the deduplication:
neighbor
dedup methodOutput in the terminal during staging brought my attention to some syntax that could potentially be improved in the neighbor method.
The message might be referring to this syntax. The link to the pandas documentation that describes the better syntax is here.
I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment, but used np.nan
for the nodata_val
in the config (and made sure to import numpy as np
at the top of the config)
The result is still this strange yellow layer that also resulted from setting nodata_val
to 999
and None
:
The ConfigManager.py
specifes that the nodata_val
should be able to be set to an integer, float, None
, or np.nan
.
If we do use 0 as the nodata_val
here, which is the only value I have tried so far that doesn't result in that yellow layer, we would be using it the same way we do for the IWP data in the sense that in both datasets palettes we are not differentiating between regions where no data was collected and regions that have data but no polygons were identified/lakes size changed.
I re-processed the web tiles from the same correctly deduplicated geotiff's in the previous comment
Have you tried setting the nodata_val
before rasterization? You could dig through the code to check if this is in fact how it works, but theoretically the workflow should set the pixels without any polygons to the no_dataval
when first creating the highest level geotiffs.
Thanks for the feedback @robyngit! I'll re-start the workflow at the raster highest step rather than web-tiling, with no-data val
set to 999
or np.Nan
and see if that works.
I didn't think that the nodata value would be used in the raster highest step because searching for nodata_val
within viz-raster
shows that the only occurrences are when web tiles are created. Within RasterTiler.py
, nodata_val
occurs when we execute webtile_from_geotiff()
. However, this was just a quick string search on Github rather than a deep dive into how you use nodata_val
throughout the package, so it's totally possible that it's used earlier in the raster highest step and I'm missing it.
Maybe the raster highest step gets the nodata_val
from get_raster_config()
defined in viz-staging
which is called within rasterize_vector()
here
Looks like I made it so that no data pixels are always zero in the the Raster
class 😕, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L772-L773
Grid cells without any data will be filled with 0.
We would have to make it so that the 0
in the fill_value=0
line uses the nodata_val from the config, see: https://github.com/PermafrostDiscoveryGateway/viz-raster/blob/92924077ccf6083442c9c2aaf40a8f164818f2b9/pdgraster/Raster.py#L809
nodata_val
In line with Robyn's observation, which I did not see until after I tried this, I re-rasterized the same deduplicated staged
tiles for these 2 UTM zones, so started the workflow at raster highest, with the nodata_val
set to 999
. The resulting web tiles still have the yellow haze (see demo portal). I will open a branch to integrate that change to enable non-0 nodata_val
s.
Ingmar helpfully created a new issue to document his data cleaning to remove seawater and rivers: https://github.com/PermafrostDiscoveryGateway/landsattrend-pipeline/issues/8
https://drive.google.com/drive/folders/18pC-FW9Nibmkcv7DPlzzT3YW4Aim0k7C?usp=sharing
Thank you, @initze ! Well done.
This data has been uploaded to Datateam at:
/var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03
along with a README file. I included Ingmar's notes above, as well as my notes and links to the relevant github issues.
I did an initial check of the new, filtered version of the lake change dataset: /var/data/submission/pdg/nitze_lake_change/filtered_parquet_2024-04-03/Lakes_global-001.parquet
There are NA
values present in the attributes ChangeRateNet_myr-1
and ChangeRateGrowth_myr-1
. As I did for the previous version of the lake change data that contained NA values for certain attributes, I can clean this data by removing the entire row if it contains NA values for an attribute.
I focused on this dataset today because it makes sense to use this data for 4 different but related goals for the PDG project, which are of varying degrees of priority:
viz-staging
issue#36)I used a subset of 10,000 polygons of the Lake Change data in parquet format as input into the visualization workflow with 3 stats:
ChangeRateNet_myr-1
which is an attribute in Ingmar's dataarea_per_pixel_area
, a custom stat in the viz workflowcentroids_per_pixel
, a custom stat in the viz workflowThey are up on the demo portal.
ChangeRateNet_myr-1
area_per_pixel_area
(% coverage)centroids_per_pixel
(number of lakes)Zooming in, the pixels that represent the center of each lake are so high res (and therefore so small) that they are hard to see.
ChangeRateNet_myr-1
and area_per_pixel_area
, the polygons appear to be blocky and low resolution even though the max z-level is 13
simplify_tolerance
, which I set to 0.1. This can be much higher, such as 0.0001, and see if this resolves it. If the simplification tolerance does fix this issue, it would be worth investigating the minimum or maximum tolerance that should be allowed based on the max zoom level specified in the config. For example, considering the max zoom level set as the z_range
, then the workflow should give a warning, an error, or automatically change the resolution of the tolerance based on what it should be.
nodata_val
is set to 0 in the viz workflow (see viz-raster
issue#16)ChangeRateNet_myr-1
works wellI created a larger sample of 20,000 polygons and reprocessed the data with a much smaller simplify_tolerance
= 0.0001. This succeeded in retaining the lake shapes of the polygons.
The larger sample size highlighted that there are other polygons that appear to be seawater in addition to the island one pictured above in red in the last comment. So maybe the filtering could be refined.
Since I resolved this issue, here's the path forward for processing the lake change data:
True
for staging_duplicated
before passing the data to meChangeRateNet_myr-1
as well as any other attributes that Ingmar or Anna wants visualized from this datasetexplode
those polygons into singular ones with copied attributes for the new rows as part of the pre-viz processing. This is because if we input geometries of any kind besides single polygons then the workflow will not process those geometries. However, this dataset contains attributes related to area and perimeter of the polygons, which means that those attributes will not be accurate for those exploded polygons, so in my opinion that would not be good practice for this dataset. This is noted here.
explode
the multipolygon geometries himself, then make the values NA
for those new rows for the attributes that cannot be accurately copied over into the news rows. For example, for a multipolygon the attribute ChangeRateNet_myr-1
would be able to correctly translate to the new rows of exploded singular polygons, but the area attribute should not be copied over, so that should be converted to NAChangeRateNet_myr-1
for this first version of the datasetIngmar helpfully provided some big picture dataset info:
Question | Answer |
---|---|
How many lakes are mapped? | about 6-6.2 million. |
Total dataset size? | the uncleaned, merged dataset is 4.13GB |
Total area of the lakes? | the uncleaned, merged dataset is ~950,000 km² but it will become smaller once deduplication is applied |
For the derived lake drainage dataset, would we include a binary attribute data for simply "this lake is present or drained", or would it be an ordered categorical attribute including "partially drained", "totally drained", and "not drained"? | "Thats a tricky one. drained must typically refer to some reference date, as lakes are often polycyclic. Ideally we would have a drainage year. We have an attribute "NetChange_perc" which actually provides that information. In a paper I am working on we have >25% <= 75% loss as partially drained, >75% as completely drained. However change must be > 1 ha as small lakes/changes are quite uncertain" |
For the derived lake drainage dataset, would the geometry be in the form of polygons (the perimeter of the lake) or a point? | Polygons. For other analysis or visualization we can still reduce to centroid/points or something similar |
After some mapping in QGIS, and without doing systematic checks to confirm for sure, polygons that border or intersect the antimeridian seem to have been identified in Ingmar's lake change dataset when the neighbors deduplication approach was applied to 2 adjacent UTM zones, 32601-32602. UTM zone 32601 borders the antimeridian. The neighbors deduplication approach transforms the geometries from their original CRS into the one specified as distance_crs
so the anitmeridian polygons are distorted and wrap the opposite way around the world. See this screenshot of the distorted 32601 (green) and 32602 (yellow)
This results in polygons in zone 32602 overlapping spatially with polygons from 32601 that are not actually in the region of overlap. Here's a screenshot from Jonas (a collaborator with Ingmar and Todd) who mapped the polygons that were identified as duplicates in red:
Mapping the flagged duplicate polygons on top of the distorted zone 32601 shows suspicious overlap:
If this is intersection with the antimeridian is indeed the cause of the incorrectly flagged duplicates (those that lie outside of the overlap between 2 adjacent UTM zones), then the neighbor deduplication should work well for all zones that do not intersect the antimeridian. Todd is looking into applying to approach to all UTM zones, and figuring out how to get around it for zone 32601. My recommendation is to identify which polygons do cross the antimeridian, split them, and buffer them slightly away from the line. Code for this was applied to the permafrost and ground ice dataset here.
Hi, this is Jonas from AWI.
I was initially asked by Todd regarding the files but i was not sure about the problem. But your post motivated me to look into this a bit further. I am not familiar with the details of the de-duplication but i understand that you reproject the UTM layers (32601, 32601) to EPSG:3857 to find the spatially overlaps which causes the vertices on the other side of the antimeridian to wrap around?
I guess you should try to project both datasets into a CRS where the coordinates do not wrap around at the antimeridian for these areas. I for example used EPSG:3832 to create the very screenshot you showed above to get rid of the polygon distortion. I didn't realize that this was actually the problem, so i didn't even mention this to Todd. I am optimistic if you project into EPSG:3832 you'll get the correct results for the UTM zones bordering the antimeridian.
Another option would be to just declare the datasets to be a non-critical UTM zone (for example 32603 and 32604) basically translating all polygons to the east before reprojection, so the longitude coordinates do not wrap around in EPSG:3857. I guess the inaccuracy is negligible. After de-duplication project back to 32603/04 and then translate back to 32601/02. But i guess the first option is preferable.
Sorry if I misunderstood the problem or if i am totally off track here.
Hi Jonas, thanks for your suggestions! You are correct that the CRS is configurable for the deduplication step, and I simply used the default projected CRS in my example script that I passed off to Todd. Todd and Ingmar wanted an example of how to execute the deduplication with adjacent UTM zones before the data is input into the visualization workflow. Given my example script and explicit parameters, your team can change the parameters as you see fit.
The transformation to the new CRS, EPSG 3857, is indeed what causes the geometries to become deformed. However, this CRS is not the only CRS that causes this issue, and the deduplication step is not the only time we transform the data in the workflow. I have encountered similarly deformed geometries in other datasets when converting to EPSG 4326 (not projected), which we do during the initial standardization of the data during the "staging" step. That unprojected CRS is required for the Tile Matrix Set of our viz workflow's output geopackage and geotiff tilesets. This means that you may be able to deduplicate the lake change data prior to the visualization workflow with a different CRS and retain the original lake geometries that border the antimeridian, but they will likely be problemic in the same way when we create tilesets from them, requiring buffering from the antimeridian as a "cleaning"/preprocessing step before staging anyway.
import geopandas as gpd
import matplotlib.pyplot as plt
gdf = gpd.read_file("~/check_dedup_forIN/check_for_todd/check_0726/lake_change_32601.gpkg")
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
gdf_4326 = gdf.to_crs(epsg = 4326, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_4326.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
gdf_3857 = gdf.to_crs(epsg = 3857, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_3857.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
gdf_3832 = gdf.to_crs(epsg = 3832, inplace = False)
fig, ax = plt.subplots(figsize=(10, 10))
gdf_3832.plot(cmap = 'viridis', linewidth = 0.8, ax = ax)
This last plot shows the data transformed into your suggested CRS, and as you suggested it shows no wrapping around the world, however the description of that CRS makes me wonder if the transformation to that CRS would slightly deform the geometries since it appears that the suggested use area for that CRS does not contain UTM zone 32601. But please correct me if I am wrong. But this amount of deformation may likely be negligible.
I don't follow your last suggestion fully, but it does sound like a function I know exists in R, ST_ShiftLongitude.
Documentation for the neighbor deduplication approach can be found here. The source code is here.
Your concerns make sense (but i would personally argue the 3857 isn't really precise either but that's another can of worms 😉 ).
I might find the time to look into this further, but for now i guess in the end it boils down to the origin of the reference system. From what i remember from my GIS lectures:
In most predefined reference systems with roughly global coverage the longitude/x-origin is located at the 0° meridian with an extent of -180° to +180° in the lon-axis. If you project a polygon overlapping the antimeridian into these CRS, some vertices of those polygons will be transformed to having an x-coordinate close to -180 and others close to +180 (if in degrees). That is what happens here.
However, this is just a way the coordinate extent is defined and that can be changed. One can simply modify it to set the origin close to the area of interest. You can observe those in the CRS config (I think its most convenient and insightful to look at the PROJ4 string):
EPSG:4326: +proj=longlat +datum=WGS84 +no_defs +type=crs
EPSG:3857: +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs
EPSG:3832: +proj=merc +lon_0=150 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs
note 4326 and 3857 use a +lon_0
value of 0
, while 3852 uses a +lon_0
valueof 150
. That means it sets its origin to 150° east (see https://proj.org/en/9.4/usage/projections.html). So in that case the x-coordinates of the vertices of the polygons in the lake_change dataset are all set around the +30° value which corresponds to the antimeridian.
So if we want to stick to EPSG:3857 we can use it as a base:
For example in QGIS you can define this custom CRS (in Settings -> Custom projections),
+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs
which is just EPSG:3857 with the origin rotated (+lon_0=180
) and it displays both files just fine.
So if you transform both GeoDataFrames into that custom CRS before getting the intersection i think it might work and you are technically still in 3857, just the origin is different:
import pyproj
import geopandas as gpd
gdf = gpd.read_file("lake_change_32601.gpkg")
crs_3857_rotated = pyproj.CRS.from_proj4("+proj=merc +a=6378137 +b=6378137 "
"+lat_ts=0 +lon_0=180 +x_0=0 +y_0=0 +k=1 "
"+units=m +nadgrids=@null +wktext +no_defs")
gdf_3857r = gdf.to_crs(crs_3857_rotated, inplace = False)
gdf_3857r.plot(cmap = 'viridis', linewidth = 0.8)
Note that for production one might consider not using the proj4 string for the definition of the custom CRS, but the WKT string, which is presumably more precise: https://proj.org/en/9.4/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems . the Proj library provides conversion tools, the setting corresponing to lon_0
is imho
...
PARAMETER["Longitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
...
but i didn't test that.
I looked in your code to test the deduplication with this, but it seems like the projection is done beforehand (i.e. outside of deduplicate_neighbors() ) and i hadn't had the time to mock something up. But i guess if you use this custom CRS for data in the UTM zones close to the antimeridian, it will work.
Thanks so much for your thorough explanation of your idea, that's an interesting approach! Since the plan is for this deduplication to be completed by @tcnichol prior to passing the data off to me for visualization and standardization, maybe he can read through our suggestions and choose how to move forward.
I'd like to clarify when the CRS transformations take place. Since deduplication will occur before staging for this dataset, the transformation within the duplicate flagging process is the first time. Within deduplicate_neighbors
, the parameter distance_crs
defines the CRS that the data will be transformed into. The transformation occurs here, and note that it is done on the geometry points to check the distance between the centroids and identify the duplicates
. The next time the data is transformed is here during staging, after you pass off the data to me.
Lastly, I'd like to highlight that flagging the duplicates and removing the duplicates are different steps. When we flag the duplicates, we create a boolean column, which I set to be called staged_duplicated
for the parameter prop_duplicated
in deduplicate_neighbors
, to indicate if each polygons is a duplicate (True) or not (False). Your team is familiar with this, because the example script I provided outputs the file 32617-32618_dups_flagged.gpkg
. So as a final step before passing the data to me, simply remove the rows that are True in each GDF.
Before we determine which approach to take to deal with these polygons that cross the antimeridian, either by splitting them with a buffered the 180th degree longitude or creating a custom CRS with a different meridian, I want to first answer the question: Are the polygons that intersect the antimeridian lake detections or simply seawater and river polygons that are already supposed to be filtered out before deduplication and visualization anyway?
I did this exploration in R. The plots show that all but 1 polygons that intersect the antimeridian are seawater or rivers, which should be removed by Todd and Ingmar's filtering prior to deduplication and visualization.
Note that the same exploration should be done for the UTM zone on the other side of the 180th degree longitude and for any zones further south that also touch the meridian may be included in the analysis
Todd will first apply the seawater and river filtering to the lake change data (including the final step of removing the rows where the polygon has been flagged as False for within_land
), then will apply the antimeridian polygon splitting to the relevant UTM zones (for the few polygons that need it), then will apply the deduplication, then it is my understanding that the data can be passed to me for visualization.
Notes from a meeting today, Aug 16th are in the PDG meeting notes. They outline the next steps for Ingmar and Todd to process all UTM zones with filtering, deduplication, and merging, then validate the data with the geohash ID's. Regarding the lake polygons that intersect the antimeridian, Todd is not sure how many are in the dataset. Ingmar suggested that if they stick to polar projections for all steps for data processing prior to the visualization workflow, then the intersection the antimeridian will not be a problem at all. While this is true, this will still be a problem when the data is input into the viz workflow (as noted above here as well) because we use EPSG:4326. I emphasize this because if there are lake polygons that intersect the antimeridian then whoever runs the viz workflow on this data will need to split those polygons prior to processing with one of the methods I documented above.
datateam.nceas.ucsb.edu:/home/pdg/data/nitze_lake_change/data_sample_2022-09-09
Sample data info
The files are not final, but the general structure will be the same.
General structure
5_Lake_Dataset_Raster_02_final
).qml
) for a nice visualization.lake_change_rates_net_10cl_v3tight
shows absolute changes, negative values (red) for loss, positive values (blue) for growth.