Closed julietcohen closed 1 year ago
Got the run on Delta to work. Seems that the ray_init()
issue I experienced last week was random (on Delta's end) because I did not change the Slurm script or way I initialized ray and it worked as expected! Wonderful.
viz-staging
: bug-17-clippingToFP
Web tiles visualized:
In a way, it's great that the tiles were not deduplicated/clipped correctly with the ray workflow but were with the non-parallelized run on datateam, because it helps us narrow down the source of the trigger for deduplication!
Differences between the runs were:
viz-raster
branch: the datateam run without parallelization, where the tiles looked deduplicated/clipped, used the main
branch of viz-raster
, while the delta run with ray used the bug-ID-notRecognized
branch of viz-raster
, which was created specifically for the ray workflow to get around an error that was occurring where the ID object that is created by __start_tracking
and used by __end_tracking
could not be read. I think it is unlikely this is the cause of the trigger, because all I changed for this branch (see commit here) was replacing that ID object with another string so we could still execute __end_tracking
Ran script that opens all files in the staged
directory, checks for any True values in the column we use to identify duplicate polygons in the staging step: staging_duplicated
, and sums the number of True values. It was run on both the Datateam staged
dir as well as the Delta staged
dir that processed the same 3 adjacent & overlapping input files. (See comments above)
In both cases, the number of True values (duplicate polygons) was 469. This is great, because running the workflow without parallelization and with parallelization identifies the same number of duplicates.
Notes:
clip_gdf()
to report number of rows it identified as outside
and within
the footprint (see here)deduplicate_by_footprint()
, after clip_gdf()
is executed to report the length of 'keep' and 'remove' (which correspond to 'within' and 'outside', respectively) (see here)statement | # of occurrences |
---|---|
"Clipping to footprint is being executed." | 938 |
"outside is: Empty GeoDataFrame" | 728 |
"within is: Empty GeoDataFrame" | 41 |
"Length of clip_results['keep'] = 0" | 41 |
"Length of clip_results['removed'] = 0" | 68 |
This means:
This last statement is the most confusing. How could there be 41 tiles in which 0 polygons fall within the footprint?
It might not be surprising that some tiles might contain just a small portion of a shapefile, and that portion only comprises areas outside the footprint.
If I understand your check_if_any_true_dups.py
code correctly, it looks like it counts whether there are any true values for each tile. The next step might be comparing how many true values there are for all tiles. Something like...
import geopandas as gpd
import pandas as pd
from pdgstaging import TilePathManager
# The name of the column in the staged files that indicates whether a row is a duplicate
dup_col_name = "staging_duplicated"
# names for the two directories to compare (just for convenience)
dir1 = "delta"
dir1_path = "/home/jcohen/compare_staged_dt_delta/delta_data/staged/"
dir2 = "datateam"
dir2_path = "/home/jcohen/viz-staging/pre-fix_3AdjFile_run/staged/"
# Create a TilePathManager object in order to easily get the tile indices
# and staged file paths for a given tile.
tpm = TilePathManager(
tms_id="WorldCRS84Quad",
base_dirs={
dir1: {"path": dir1_path, "ext": ".gpkg"},
dir2: {"path": dir2_path, "ext": ".gpkg"},
},
)
def get_tile_info(path):
tile = tpm.tile_from_path(path)
gdf = gpd.read_file(path)
dup_col = gdf['staging_duplicated']
return {
"x": tile.x,
"y": tile.y,
"z": tile.z,
"num_dups": dup_col.sum(),
"num_rows": len(dup_col),
}
files1 = tpm.get_filenames_from_dir(dir1)
files2 = tpm.get_filenames_from_dir(dir2)
df1 = pd.DataFrame([get_tile_info(path) for path in files1])
df2 = pd.DataFrame([get_tile_info(path) for path in files2])
df = pd.merge(df1, df2, on=["x", "y", "z"], suffixes=[f"_{dir1}", f"_{dir2}"])
df["num_dups_diff"] = df[f"num_dups_{dir1}"] - df[f"num_dups_{dir2}"]
df["num_rows_diff"] = df[f"num_rows_{dir1}"] - df[f"num_rows_{dir2}"]
print(f'Number of tiles with different number of duplicates: {len(df[df["num_dups_diff"] != 0])}')
print(f'Number of tiles with different number of rows: {len(df[df["num_rows_diff"] != 0])}')
If this shows that there are no differences in the number of duplicates identified, maybe the next step would be to look for differences in the max-z GeoTiff tiles.
Thank you, Robyn! I appreciate the insight and script. I'll run it and post the results. You are correct that my script only determined if a tile contained at least 1 True value, it did not count the number of True values.
I see what you mean about the possibility for tiles to only cover a small portion of a shapefile, which only falls outside the footprint.
Output:
Number of tiles with different number of duplicates: 235
Number of tiles with different number of rows: 0
Wow!
Looks like neither Delta not Datateam consistently identified more/less duplicates than the other (there are both positive and negative non-zero values in the difference column). The non-zero values in the difference duplicates column show that 116 times, datateam identified more duplicates than delta. 119 times, delta identified more duplicates than datateam.
Did a quick little plot and it looks like indeed the differences are concentrated around where there is probably footprint overlap:
import pandas as pd
import geopandas as gpd
from shapely.geometry import box
from pdgstaging import TilePathManager
from matplotlib import pyplot as plt
path = "dups_summary.csv"
df = pd.read_csv(path)
tpm = TilePathManager(tms_id="WorldCRS84Quad")
def make_geom(row):
tile = tpm.tile(row["x"], row["y"], row["z"])
bb = tpm.get_bounding_box(tile)
n, s, e, w = bb["top"], bb["bottom"], bb["right"], bb["left"]
return box(w, s, e, n)
df["geometry"] = df.apply(make_geom, axis=1)
gdf = gpd.GeoDataFrame(df, geometry="geometry", crs=tpm.crs)
gdf.plot(column="num_dups_diff", legend=True)
plt.savefig("dups_summary.png")
plt.show()
Amazing! I was just wrangling the dataframe in R to figure out a trend in the duplicate flagging, plotting it right away was much more clever. Very informative
viz-staging
Which files are identified as duplicates depends on order of files staged?
Solution:
label_duplicates()
in deduplicate_by_footprint()
after clip_gdf()
(there is no point in labeling a polygon as a duplicate if we are going to clip it anyway), and execute these functions in the same order regardless if there is one or more input files that produced polygons in the geodatafame (basically remove this code?)
combine_and_deduplicate()
regardless if the tile is a new file or the polygons are being added to an existing file (code to change is here, remove the line if os.path.isfile(tile_path):
)Here are details of my suggestion of an easier way to do this: The "clip" step could go in the main stage
method, here-ish: https://github.com/PermafrostDiscoveryGateway/viz-staging/blob/4f31e951600d54c128f76b48a47ec390261fb548/pdgstaging/TileStager.py#L129-L130
So you would have something like....
def stage(self, path):
# ...
gdf = self.set_crs(gdf)
gdf = self.clip_to_footprint(gdf) # <--- new step
self.grid = self.make_tms_grid(gdf)
# ...
# Add a new method to the class
def clip_to_footprint(self, gdf):
# - check the config to see if we should clip at all, if not, return the gdf
# - then find the associated footprint file , you could use config.footprint_path_from_input(path, check_exists=True)
# - then run deduplicator.clip_gdf and deduplicator.label_duplicates on the gdf
# - return the clipped gdf
Then you would need to make sure that the workflow doesn't try to clip again at the save tiles stage.
Thank you, Robyn! I'm making these suggested changes in my branch.
Ran test of the new clipping and deduplication approach on Datateam.
Number of staged tiles : 3214 Number of geotiffs: 4376 Number of web tiles: 4376
Deduplication where footprints overlap is working. Clipping to footprint was executed according to the log, but seems like those polygons were not removed.
I believe the polygons flagged as duplicates were not removed because after we clip to footprint just after reading in the file and its footprint, we then execute add_properties()
before we deduplicate by footprint, which still retained a step that is from the past deduplication approach:
# Add the column to flag duplicate polygons. This will be set to True
# later if duplicates are found.
dedup_method = self.config.get_deduplication_method()
if dedup_method is not None:
# Mark all the polygons as not duplicated
gdf[self.config.polygon_prop('duplicated')] = False
This step replaces the boolean column of the same name that identifies the polygons that were labeled as duplicated because they were clipped to footprint.
Removing the above code chunk did not resolve the issue that the polygons that were flagged as duplicates because they fell outside the footprint were still not removed. So there must be multiple places in the deduplication approach that overwrite those labeled polygons in the 'duplicated' column of the tile with just the duplicates that were identified from overlapping footprint areas. I adjusted dedupicate_by_footprint()
so that the to_remove
list that is created at the start is immediately populated with the subset of the input gdf that represents the polygons that were labeled as duplicates earlier, when we executed clip_gdf
and then label_duplicates
the first time.
gdf = gdf.copy()
# `to_remove` list will hold the polygons that fit either of the criteria:
# 1. were previously labeled True for `duplicated` col because poly
# fell outside footprint
# 2. are labeled as True for `duplicated` col within this function
# based on overlap of footprints
to_remove = []
# First, add the polygons that were already labeled as duplicates because
# fell outside footprint
known_dups = gdf[gdf['duplicated'] == True]
to_remove.append(known_dups)
logger.info(f"After initially adding the previously identified dups df to list to_remove, length is {len(to_remove)}\nand some values in list are {to_remove[0:10]}.")
# Will hold the polygons that defined the footprint intersections
intersections = []
The output webtiles looked the same. The search continues!
See viz-staging commit.
After successfully clipping to footprint and removing duplicates where the footprints overlap, the next step is to change the geopandas' sjon predicate for clipping to footprint in order to eliminate the light strips we see between some adjacent shapefiles when you zoom in and look real closely:
Instead of using 'within', which only retains the polygons that fall completely inside the footprint boundary, we want to retain polygons that fall completely within the footprint boundary and those that partially fall within the boundary, but hang over the edge. Changing it to 'intersects' did the trick!
Wow, that looks fantastic @julietcohen !!!
Thanks Robyn!
Note: it is noticeable that there is a light strip (that we noticed we weren't seeing anymore after switching the clip_gdf
predicate from 'within' to 'intersects'), along the border of the bottom 2 tiles. I wonder why the border of the upper 2 tiles does not show this, while the bottom does.
Regardless, this is a big improvement in the deduplication using the Ray workflow.
Next run on Delta:
logging.info()
is still largely not working) to state when clipping is happening and the predicate used Setup:
high/russia/226_227_iwp
dir) ["#a15e5e", # gray-red
"#aa5555",
"#b44b4b",
"#be4141",
"#c93636",
"#d42b2b",
"#e31c1c",
"#f50a0a",
"#ff0000" # bright red]
viz-staging
branch bug-17-clippingToFP
viz-raster
branch bug-ID-notRecognized
staging:
/tmp
directories (head, and 1 worker)raster highest:
during rasterization highest, all 4 nodes were producing geotiff files in their respective /tmp
directories
head node was at ~25% CPU, ~16% mem
worker nodes were at ~25% CPU, ~15% mem but producing geotiffs slower than head node
interesting errors written, which don't make sense to me because the geotiff dir is created initially in each nodes' /tmp
and is empty, but seems like a couple geotiffs are attempted to be written multiple times?
total number of geotiffs highest produced: 50,184 (so like 9 errored/did not get written)
6 min
raster lower:
raster_summary.csv
before executing web tiling, as usualweb tiles:
Final output:
Relics of edge effects are still visible between tiles but only in the form of clipping to footprint, not where footprints overlap. However, these edge effects are not present between every tile, just every once in a while you spot one when scrolling around.
Then when you zoom in on some of the border strips, it's clear there are still polygons that were indeed retained that just fell on the edge of the footprint, which is what we want (we just want to remove the polys that fall completely outside the footprint). I wonder if these empty strips that are present between some tiles are unavoidable, no matter the predicate used.
Additional note: raster_summary.csv
still shows the maximum value for iwp_coverage
statistic as >1. I still have to remove 3 or 4 incorrectly formatted rows and re-upload the csv in between raster lower and web tiling.
The values >1 occurs on datateam (without parallelization) as well.
Plotting the footprints for these input files shows the area that was covered by input files (with partial transparency to show where footprints overlap). This shows that the strip of "missing" IWP on the map is a result of the lack of input data for that area, not errors in the visualization workflow.
Nice -- let's talk to Elias and Chandi and figure out if this missing data is unexpected from their perspective.
The strip of missing tiles shown above is due to the 180th meridian, which falls on Wrangle Island. See here:
I finalized the changes to the new viz-staging
deduplication approach today. This new approach runs smoothly on both Delta as well as Datateam with identical output. The logging in both the ray workflow and simple viz-workflow
now works as well!
The branch bug-17-clippingToFP
has also been tested to confirm it works when the dedup_method
is set to None
and clipping_to_footprint
is set to False
in the config (the result is what we expect with that config):
The neighbors
deduplication approach has been modified slightly as well (essentially removing the label_duplicates()
as a separate function, and the necessary parts have been tailored to and integrated into the neighbors
method), but the neighbors
method has not been thoroughly tested yet. We decided that it is higher priority to process all the IWP data before we test it. As a result, the the viz-staging
package with these changes has been documented that the neighbors
approach is not to be used until a release after 0.1.0. The release 0.1.0 is tailored to the footprints
dedup method.
The deduplication / clipping to footprint seems to be working on the
viz-staging
branchbug-17-clippingToFP
no matter what order the files are staged in (using 3 adjacent shp files as input). Keep in mind that this was not using parallelization. I would like to test if this branch deduplicates / clips to footprint with parallelization with ray on the Delta server, too, before I jump into making any other major changes to the branch.Run on datateam
Web tiles visualized on local cesium show successful deduplication / clipping to footprint:
Very small strips of no polygons between tiles indicates that clipping is working, but we should change the predicate from "within" to something else so we retain the polygons that overlap the footprint boundary rather than remove those as well as the footprints that fall completely outside the footprint boundary:
Run on Delta
Major errors with initializing ray. Couldn't get the
IN_PROGRESS_VIZ_WORKFLOW.py
script to start, hung at theray.init()
line.