Comparing deduplication/clipping to footprint with/without parallelization

julietcohen commented 1 year ago

The deduplication / clipping to footprint seems to be working on the viz-staging branch bug-17-clippingToFP no matter what order the files are staged in (using 3 adjacent shp files as input). Keep in mind that this was not using parallelization. I would like to test if this branch deduplicates / clips to footprint with parallelization with ray on the Delta server, too, before I jump into making any other major changes to the branch.

Run on datateam

not in parallel
see notebook here
produce 4376 geotiffs and 4376 web tiles for all z-levels

Web tiles visualized on local cesium show successful deduplication / clipping to footprint:

Very small strips of no polygons between tiles indicates that clipping is working, but we should change the predicate from "within" to something else so we retain the polygons that overlap the footprint boundary rather than remove those as well as the footprints that fall completely outside the footprint boundary:

Run on Delta

Major errors with initializing ray. Couldn't get the IN_PROGRESS_VIZ_WORKFLOW.py script to start, hung at the ray.init() line.

julietcohen commented 1 year ago

Run on delta with same 3 tiles

Got the run on Delta to work. Seems that the ray_init() issue I experienced last week was random (on Delta's end) because I did not change the Slurm script or way I initialized ray and it worked as expected! Wonderful.

run is to compare web tile output with web tile output from datateam run posted above that did not use parallelization
Same 3 tiles visualized.
Used same branch of viz-staging: bug-17-clippingToFP

Web tiles visualized:

In a way, it's great that the tiles were not deduplicated/clipped correctly with the ray workflow but were with the non-parallelized run on datateam, because it helps us narrow down the source of the trigger for deduplication!

stderr.txt

``` + echo 'This is BEST-v2_gpu_ray.slurm' + source /scratch/bbou/julietcohen/venv/iwp_3/bin/activate ++ deactivate nondestructive ++ '[' -n '' ']' ++ '[' -n '' ']' ++ '[' -n /bin/bash -o -n '' ']' ++ hash -r ++ '[' -n '' ']' ++ unset VIRTUAL_ENV ++ '[' '!' nondestructive = nondestructive ']' ++ VIRTUAL_ENV=/scratch/bbou/julietcohen/venv/iwp_3 ++ export VIRTUAL_ENV ++ _OLD_VIRTUAL_PATH=/scratch/bbou/julietcohen/venv/iwp_3/bin:/u/julietcohen/.vscode-server/bin/ee2b180d582a7f601fa6ecfdad8d9fd269ab1884/bin/remote-cli:/u/julietcohen/.cargo/bin:/u/julietcohen/.local/bin:/u/julietcohen/bin:/sw/spack/delta-2022-03/apps/cuda/11.6.1-gcc-11.2.0-vglutoe/bin:/sw/spack/delta-2022-03/apps/openmpi/4.1.2-gcc-11.2.0-37px7gc/bin:/sw/spack/delta-2022-03/apps/ucx/1.11.2-gcc-11.2.0-pymppfm/bin:/sw/spack/delta-2022-03/apps/gcc/11.2.0-gcc-8.4.1-fxgnsyr/bin:/sw/user/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ddn/ime/bin:/opt/puppetlabs/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin ++ PATH=/scratch/bbou/julietcohen/venv/iwp_3/bin:/scratch/bbou/julietcohen/venv/iwp_3/bin:/u/julietcohen/.vscode-server/bin/ee2b180d582a7f601fa6ecfdad8d9fd269ab1884/bin/remote-cli:/u/julietcohen/.cargo/bin:/u/julietcohen/.local/bin:/u/julietcohen/bin:/sw/spack/delta-2022-03/apps/cuda/11.6.1-gcc-11.2.0-vglutoe/bin:/sw/spack/delta-2022-03/apps/openmpi/4.1.2-gcc-11.2.0-37px7gc/bin:/sw/spack/delta-2022-03/apps/ucx/1.11.2-gcc-11.2.0-pymppfm/bin:/sw/spack/delta-2022-03/apps/gcc/11.2.0-gcc-8.4.1-fxgnsyr/bin:/sw/user/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ddn/ime/bin:/opt/puppetlabs/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin:/opt/ddn/ime/bin ++ export PATH ++ '[' -n '' ']' ++ '[' -z '' ']' ++ _OLD_VIRTUAL_PS1= ++ PS1='(iwp_3) ' ++ export PS1 ++ '[' -n /bin/bash -o -n '' ']' ++ hash -r + ulimit -n 32768 + export RAY_max_pending_lease_requests_per_scheduling_category=2000 + RAY_max_pending_lease_requests_per_scheduling_category=2000 ++ scontrol show hostnames 'gpub[017,022,027]' + nodes='gpub017 gpub022 gpub027' + nodes_array=($nodes) + head_node=gpub017 ++ srun --nodes=1 --ntasks=1 -w gpub017 hostname --ip-address + head_node_ip=172.28.23.117 + printf '%s\n' gpub017 gpub022 gpub027 + [[ 172.28.23.117 == *\ * ]] + port=6379 + ip_head=172.28.23.117:6379 + export ip_head + echo 'IP Head: 172.28.23.117:6379' + echo 'Starting HEAD at gpub017' + srun --nodes=1 --ntasks=1 -w gpub017 ray stop + sleep 1 + srun --nodes=1 --ntasks=1 -w gpub017 ray start --head --node-ip-address=172.28.23.117 --port=6379 --dashboard-host 0.0.0.0 --log-color true --block + export RAY_worker_register_timeout_seconds=120 + RAY_worker_register_timeout_seconds=120 + worker_num=2 + (( i = 1 )) + (( i <= worker_num )) + node_i=gpub022 + srun --nodes=1 --ntasks=1 -w gpub022 ray stop + echo 'Starting WORKER 1 at gpub022' + sleep 0.25 + srun --nodes=1 --ntasks=1 -w gpub022 ray start --address 172.28.23.117:6379 --log-color true --block + echo 'WORKER 1 has SLURM_CPUS_PER_TASK: ' + (( i++ )) + (( i <= worker_num )) + node_i=gpub027 + srun --nodes=1 --ntasks=1 -w gpub027 ray stop [2023-03-20 11:31:36,704 I 2858961 2858961] global_state_accessor.cc:356: This node has an IP address of 172.28.23.122, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container. + echo 'Starting WORKER 2 at gpub027' + sleep 0.25 + srun --nodes=1 --ntasks=1 -w gpub027 ray start --address 172.28.23.117:6379 --log-color true --block + echo 'WORKER 2 has SLURM_CPUS_PER_TASK: ' + (( i++ )) + (( i <= worker_num )) + sleep infinity srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 1600555.4 ON gpub022 CANCELLED AT 2023-03-20T12:13:50 *** slurmstepd: error: *** STEP 1600555.2 ON gpub017 CANCELLED AT 2023-03-20T12:13:50 *** slurmstepd: error: *** JOB 1600555 ON gpub017 CANCELLED AT 2023-03-20T12:13:50 *** slurmstepd: error: *** STEP 1600555.6 ON gpub027 CANCELLED AT 2023-03-20T12:13:50 *** ```

stdout.txt

``` This is BEST-v2_gpu_ray.slurm IP Head: 172.28.23.117:6379 Starting HEAD at gpub017 2023-03-20 11:31:03,986 INFO scripts.py:1101 -- Did not find any active Ray processes. 2023-03-20 11:31:06,074 INFO usage_lib.py:466 -- Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2023-03-20 11:31:06,074 INFO scripts.py:710 -- [37mLocal node IP[39m: [1m172.28.23.117[22m 2023-03-20 11:31:12,125 SUCC scripts.py:747 -- [32m--------------------[39m 2023-03-20 11:31:12,125 SUCC scripts.py:748 -- [32mRay runtime started.[39m 2023-03-20 11:31:12,125 SUCC scripts.py:749 -- [32m--------------------[39m 2023-03-20 11:31:12,126 INFO scripts.py:751 -- [36mNext steps[39m 2023-03-20 11:31:12,126 INFO scripts.py:752 -- To connect to this Ray runtime from another node, run 2023-03-20 11:31:12,126 INFO scripts.py:755 -- [1m ray start --address='172.28.23.117:6379'[22m 2023-03-20 11:31:12,126 INFO scripts.py:771 -- Alternatively, use the following Python code: 2023-03-20 11:31:12,126 INFO scripts.py:773 -- [35mimport[39m[26m ray 2023-03-20 11:31:12,126 INFO scripts.py:777 -- ray[35m.[39m[26minit(address[35m=[39m[26m[33m'auto'[39m[26m, _node_ip_address[35m=[39m[26m[33m'172.28.23.117'[39m[26m) 2023-03-20 11:31:12,126 INFO scripts.py:790 -- To see the status of the cluster, use 2023-03-20 11:31:12,126 INFO scripts.py:791 -- [1mray status[22m[26m 2023-03-20 11:31:12,126 INFO scripts.py:794 -- To monitor and debug Ray, view the dashboard at 2023-03-20 11:31:12,126 INFO scripts.py:795 -- [1m141.142.145.117:8265[22m[26m 2023-03-20 11:31:12,126 INFO scripts.py:801 -- [4mIf connection fails, check your firewall settings and network configuration.[24m 2023-03-20 11:31:12,126 INFO scripts.py:809 -- To terminate the Ray runtime, run 2023-03-20 11:31:12,126 INFO scripts.py:810 -- [1m ray stop[22m 2023-03-20 11:31:12,126 INFO scripts.py:888 -- [36m[1m--block[22m[39m 2023-03-20 11:31:12,126 INFO scripts.py:889 -- This command will now block forever until terminated by a signal. 2023-03-20 11:31:12,126 INFO scripts.py:892 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported. 2023-03-20 11:31:32,296 INFO scripts.py:1101 -- Did not find any active Ray processes. Starting WORKER 1 at gpub022 WORKER 1 has SLURM_CPUS_PER_TASK: 2023-03-20 11:31:34,276 INFO scripts.py:866 -- [37mLocal node IP[39m: [1m172.28.23.122[22m 2023-03-20 11:31:36,705 SUCC scripts.py:878 -- [32m--------------------[39m 2023-03-20 11:31:36,706 SUCC scripts.py:879 -- [32mRay runtime started.[39m 2023-03-20 11:31:36,706 SUCC scripts.py:880 -- [32m--------------------[39m 2023-03-20 11:31:36,706 INFO scripts.py:882 -- To terminate the Ray runtime, run 2023-03-20 11:31:36,706 INFO scripts.py:883 -- [1m ray stop[22m 2023-03-20 11:31:36,706 INFO scripts.py:888 -- [36m[1m--block[22m[39m 2023-03-20 11:31:36,706 INFO scripts.py:889 -- This command will now block forever until terminated by a signal. 2023-03-20 11:31:36,706 INFO scripts.py:892 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported. 2023-03-20 11:31:54,979 INFO scripts.py:1101 -- Did not find any active Ray processes. Starting WORKER 2 at gpub027 WORKER 2 has SLURM_CPUS_PER_TASK: 2023-03-20 11:31:56,868 INFO scripts.py:866 -- [37mLocal node IP[39m: [1m172.28.23.127[22m 2023-03-20 11:31:58,978 SUCC scripts.py:878 -- [32m--------------------[39m 2023-03-20 11:31:58,978 SUCC scripts.py:879 -- [32mRay runtime started.[39m 2023-03-20 11:31:58,978 SUCC scripts.py:880 -- [32m--------------------[39m 2023-03-20 11:31:58,978 INFO scripts.py:882 -- To terminate the Ray runtime, run 2023-03-20 11:31:58,978 INFO scripts.py:883 -- [1m ray stop[22m 2023-03-20 11:31:58,979 INFO scripts.py:888 -- [36m[1m--block[22m[39m 2023-03-20 11:31:58,979 INFO scripts.py:889 -- This command will now block forever until terminated by a signal. 2023-03-20 11:31:58,979 INFO scripts.py:892 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported. ```

Differences between the runs were:

using ray versus not using ray
viz-raster branch: the datateam run without parallelization, where the tiles looked deduplicated/clipped, used the main branch of viz-raster, while the delta run with ray used the bug-ID-notRecognized branch of viz-raster, which was created specifically for the ray workflow to get around an error that was occurring where the ID object that is created by __start_tracking and used by __end_tracking could not be read. I think it is unlikely this is the cause of the trigger, because all I changed for this branch (see commit here) was replacing that ID object with another string so we could still execute __end_tracking

julietcohen commented 1 year ago

Staging step successfully identifies duplicate polygons both with and without parallelization

Ran script that opens all files in the staged directory, checks for any True values in the column we use to identify duplicate polygons in the staging step: staging_duplicated, and sums the number of True values. It was run on both the Datateam staged dir as well as the Delta staged dir that processed the same 3 adjacent & overlapping input files. (See comments above)

check_if_any_true_dups.py

``` # Check if any staged files have True for the staging_duplicate property # comparing delta staged data to datateam staged data # delta web tiles looked like they were not dedup # datateam web tiles looked like they were dedup from pathlib import Path import geopandas as gpd import glob # ---------------------------------------------------------------------- # RUN THIS SCRIPT ONCE FOR EACH DATASET, COMMENT OUT ONE AT A TIME # DELTA DATA # check if the staged files have any True values in their 'staging_duplicated' column #input_dir = Path('/home/jcohen/compare_staged_dt_delta/delta_data/staged/') #gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # DATATEAM DATA input_dir = Path('/home/jcohen/viz-staging/pre-fix_3AdjFile_run/staged/') gpkg_files_posix = sorted(input_dir.rglob('*.gpkg')) # ---------------------------------------------------------------------- print(f"Found {len(gpkg_files_posix)} Delta staged files. Should be 3214 for both Delta and Datateam datasets.") # remove prefix "PosixPath()" part from each full path gpkg_files = [] for file in gpkg_files_posix: # extract just str filepath file_str = str(file) gpkg_files.append(file_str) print("Removed Posix() part of all filepaths.\nChecking for True values in `staged_duplicated` col.") true_values = [] false_values = [] # sum of len of true_props and len of false_props should be 3214. # because that is the total number of staged files # and the any() function returns 1 output per file for file in gpkg_files: gdf = gpd.read_file(file) #values.append(str(gdf['staging_duplicated'])) if any(gdf['staging_duplicated']): # if a True value is present in any row of this df's 'staging_duplicated' column # then add an observation to the True list true_values.append("True value detected") else: # if False is present every row of this df's 'staging_duplicated' column # then add an observation to the False list false_values.append("all False values detected") print(f"Is the sum of the two lists ? {(len(true_values) + len(false_values)) == len(gpkg_files)}") print(f"The number of staged files with at least 1 True value detected is {len(true_values)}.\n0 files means that 0/{len(gpkg_files)} polygons were identified as duplicates.") ```

In both cases, the number of True values (duplicate polygons) was 469. This is great, because running the workflow without parallelization and with parallelization identifies the same number of duplicates.

julietcohen commented 1 year ago

Log file for non-parallelized run on Datateam

Notes:

before the run, I inserted a logging statement to report when clipping is occurring
also inserted logging statement within clip_gdf() to report number of rows it identified as outside and within the footprint (see here)
also inserted logging statement within deduplicate_by_footprint(), after clip_gdf() is executed to report the length of 'keep' and 'remove' (which correspond to 'within' and 'outside', respectively) (see here)

statement	# of occurrences
"Clipping to footprint is being executed."	938
"outside is: Empty GeoDataFrame"	728
"within is: Empty GeoDataFrame"	41
"Length of clip_results['keep'] = 0"	41
"Length of clip_results['removed'] = 0"	68

This means:

clipping is being executed 938 times in total. Recall that the number of shapefiles fed into this run is 3, and the total number of resulting staged files is 3214.
some polygons were identified as falling outside the footprint 210/938 times that clipping to footprint was executed, and 728 times there were 0 polygons identified as falling outside the footprint
some polygons were identified as falling within the footprint 897/938 times that clipping to footprint was executed, and 41 times there were 0 polygons identified as falling within the footprint
This last statement is the most confusing. How could there be 41 tiles in which 0 polygons fall within the footprint?

robyngit commented 1 year ago

This last statement is the most confusing. How could there be 41 tiles in which 0 polygons fall within the footprint?

It might not be surprising that some tiles might contain just a small portion of a shapefile, and that portion only comprises areas outside the footprint.

If I understand your check_if_any_true_dups.py code correctly, it looks like it counts whether there are any true values for each tile. The next step might be comparing how many true values there are for all tiles. Something like...

import geopandas as gpd
import pandas as pd
from pdgstaging import TilePathManager

# The name of the column in the staged files that indicates whether a row is a duplicate
dup_col_name = "staging_duplicated"

# names for the two directories to compare (just for convenience)
dir1 = "delta"
dir1_path = "/home/jcohen/compare_staged_dt_delta/delta_data/staged/"
dir2 = "datateam"
dir2_path = "/home/jcohen/viz-staging/pre-fix_3AdjFile_run/staged/"

# Create a TilePathManager object in order to easily get the tile indices
# and staged file paths for a given tile.
tpm = TilePathManager(
    tms_id="WorldCRS84Quad",
    base_dirs={
        dir1: {"path": dir1_path, "ext": ".gpkg"},
        dir2: {"path": dir2_path, "ext": ".gpkg"},
    },
)

def get_tile_info(path):
    tile = tpm.tile_from_path(path)
    gdf = gpd.read_file(path)
    dup_col = gdf['staging_duplicated']
    return {
        "x": tile.x,
        "y": tile.y,
        "z": tile.z,
        "num_dups": dup_col.sum(),
        "num_rows": len(dup_col),
    }

files1 = tpm.get_filenames_from_dir(dir1)
files2 = tpm.get_filenames_from_dir(dir2)
df1 = pd.DataFrame([get_tile_info(path) for path in files1])
df2 = pd.DataFrame([get_tile_info(path) for path in files2])

df = pd.merge(df1, df2, on=["x", "y", "z"], suffixes=[f"_{dir1}", f"_{dir2}"])

df["num_dups_diff"] = df[f"num_dups_{dir1}"] - df[f"num_dups_{dir2}"]
df["num_rows_diff"] = df[f"num_rows_{dir1}"] - df[f"num_rows_{dir2}"]

print(f'Number of tiles with different number of duplicates: {len(df[df["num_dups_diff"] != 0])}')
print(f'Number of tiles with different number of rows: {len(df[df["num_rows_diff"] != 0])}')

If this shows that there are no differences in the number of duplicates identified, maybe the next step would be to look for differences in the max-z GeoTiff tiles.

julietcohen commented 1 year ago

Thank you, Robyn! I appreciate the insight and script. I'll run it and post the results. You are correct that my script only determined if a tile contained at least 1 True value, it did not count the number of True values.

I see what you mean about the possibility for tiles to only cover a small portion of a shapefile, which only falls outside the footprint.

julietcohen commented 1 year ago

Output:

Number of tiles with different number of duplicates: 235
Number of tiles with different number of rows: 0

Wow!

julietcohen commented 1 year ago

dups_summary.csv

Looks like neither Delta not Datateam consistently identified more/less duplicates than the other (there are both positive and negative non-zero values in the difference column). The non-zero values in the difference duplicates column show that 116 times, datateam identified more duplicates than delta. 119 times, delta identified more duplicates than datateam.

robyngit commented 1 year ago

Did a quick little plot and it looks like indeed the differences are concentrated around where there is probably footprint overlap:

dups_summary

import pandas as pd
import geopandas as gpd
from shapely.geometry import box
from pdgstaging import TilePathManager
from matplotlib import pyplot as plt

path = "dups_summary.csv"
df = pd.read_csv(path)
tpm = TilePathManager(tms_id="WorldCRS84Quad")

def make_geom(row):
    tile = tpm.tile(row["x"], row["y"], row["z"])
    bb = tpm.get_bounding_box(tile)
    n, s, e, w = bb["top"], bb["bottom"], bb["right"], bb["left"]
    return box(w, s, e, n)

df["geometry"] = df.apply(make_geom, axis=1)

gdf = gpd.GeoDataFrame(df, geometry="geometry", crs=tpm.crs)

gdf.plot(column="num_dups_diff", legend=True)
plt.savefig("dups_summary.png")
plt.show()

julietcohen commented 1 year ago

Amazing! I was just wrangling the dataframe in R to figure out a trend in the duplicate flagging, plotting it right away was much more clever. Very informative

julietcohen commented 1 year ago

Ideas for changes to `viz-staging`

Which files are identified as duplicates depends on order of files staged?

Files staged serially: file 1 polygons → file 2 polygons → file 3 polygons (mostly)
Files staged in parallel results in order of polygons identified as duplicates from all 3 input files much more jumbled

Solution:

Execute label_duplicates() in deduplicate_by_footprint() after clip_gdf() (there is no point in labeling a polygon as a duplicate if we are going to clip it anyway), and execute these functions in the same order regardless if there is one or more input files that produced polygons in the geodatafame (basically remove this code?)
- Alternative: label duplicates first then clip to footprint just before we save the tile, even if there is only 1 input file identified in the geodataframe.
execute combine_and_deduplicate() regardless if the tile is a new file or the polygons are being added to an existing file (code to change is here, remove the line if os.path.isfile(tile_path):)
Change predicate of clip to footprint (see options in documentation)

robyngit commented 1 year ago

Here are details of my suggestion of an easier way to do this: The "clip" step could go in the main stage method, here-ish: https://github.com/PermafrostDiscoveryGateway/viz-staging/blob/4f31e951600d54c128f76b48a47ec390261fb548/pdgstaging/TileStager.py#L129-L130

So you would have something like....

def stage(self, path):
  # ...
  gdf = self.set_crs(gdf)
  gdf = self.clip_to_footprint(gdf) # <--- new step
  self.grid = self.make_tms_grid(gdf)
  # ...

# Add a new method to the class
def clip_to_footprint(self, gdf):
  # - check the config to see if we should clip at all, if not, return the gdf
  # - then find the associated footprint file , you could use config.footprint_path_from_input(path, check_exists=True)
  # - then run deduplicator.clip_gdf and deduplicator.label_duplicates on the gdf
  # - return the clipped gdf

Then you would need to make sure that the workflow doesn't try to clip again at the save tiles stage.

julietcohen commented 1 year ago

Thank you, Robyn! I'm making these suggested changes in my branch.

julietcohen commented 1 year ago

Testing branch on Datateam with 3 adjacent IWP files

Ran test of the new clipping and deduplication approach on Datateam.

workflow.py

``` # Use branch bug-17-clippingToFP of viz-staging. # Main branch of viz-raster. # Use env that pulls local versions of packages. # env: arcade_layer # Run staging, rasterization, and web-tiling on 3 adjacent # files on Datateam and on Delta, # and download the staged files from Delta to datateam. # Then compare then number of tiles identified # as duplicates in both staged directories using Robyn's script. # Also compare output web tiles. # If the number of duplicates identified is the same, # and the web tiles look the same, the branch works as intended # for deduplication set to occur at raster step. from pathlib import Path import pdgstaging from pdgstaging import TileStager import pdgraster from pdgraster import RasterTiler # logging from datetime import datetime import logging import logging.handlers import os # removing files and dirs from past run(s) import os import shutil # remove staged dir, log.log, and staging_summary.csv from # last test run if they exist if os.path.exists("/home/jcohen/clip_fix_test/log.log"): os.remove("/home/jcohen/clip_fix_test/log.log") if os.path.exists("/home/jcohen/clip_fix_test/staging_summary.csv"): os.remove("/home/jcohen/clip_fix_test/staging_summary.csv") if os.path.exists("/home/jcohen/clip_fix_test/staged"): shutil.rmtree("/home/jcohen/clip_fix_test/staged") # define 3 adjacent input shapefiles base_dir = Path('/home/jcohen/iwp_russia_subset_clipToFP_PR/iwp') filename = '*.shp' # To define each .shp file within each subdir as a string representation with forward slashes, use as_posix() # The ** represents that any subdir string can be present between the base_dir and the filename input = [p.as_posix() for p in base_dir.glob('**/' + filename)] # define their footprints base_dir_fp = Path('/home/jcohen/iwp_russia_subset_clipToFP_PR/footprints') fps = [p.as_posix() for p in base_dir_fp.glob('**/' + filename)] # set up logging handler = logging.handlers.WatchedFileHandler( os.environ.get("LOGFILE", "/home/jcohen/clip_fix_test/log.log")) formatter = logging.Formatter(logging.BASIC_FORMAT) handler.setFormatter(formatter) root = logging.getLogger() root.setLevel(os.environ.get("LOGLEVEL", "INFO")) root.addHandler(handler) # ------------- Staging ---------------- stager = TileStager({ "deduplicate_clip_to_footprint": True, "dir_input": "/home/jcohen/iwp_russia_subset_clipToFP_PR/iwp/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_russia_subset_clipToFP_PR/footprints/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" }) for file in input: print(f"Staging {file}.") stager.stage(file) # number of staged files should be 1056 print("Staging complete. Starting rasterization & web-tiling.") # ------------- Rasterization ---------------- RasterTiler({ "deduplicate_clip_to_footprint": True, "dir_input": "/home/jcohen/iwp_russia_subset_clipToFP_PR/iwp/", "ext_input": ".shp", "ext_footprints": ".shp", "dir_footprints": "/home/jcohen/iwp_russia_subset_clipToFP_PR/footprints/", "dir_staged": "staged/", "dir_geotiff": "geotiff/", "dir_web_tiles": "web_tiles/", "filename_staging_summary": "staging_summary.csv", "filename_rasterization_events": "raster_events.csv", "filename_rasters_summary": "raster_summary.csv", "filename_config": "config", "simplify_tolerance": 0.1, "tms_id": "WGS1984Quad", "z_range": [ 0, 15 ], "geometricError": 57, "z_coord": 0, "statistics": [ { "name": "iwp_coverage", "weight_by": "area", "property": "area_per_pixel_area", "aggregation_method": "sum", "resampling_method": "average", "val_range": [ 0, 1 ], "palette": [ "#66339952", "#ffcc00" ], "nodata_val": 0, "nodata_color": "#ffffff00" }, ], "deduplicate_at": [ "raster" ], "deduplicate_keep_rules": [ [ "Date", "larger" ] ], "deduplicate_method": "footprints" }).rasterize_all() print("Rasterizing and web-tiling for all z-levels complete.") ```

Number of staged tiles : 3214 Number of geotiffs: 4376 Number of web tiles: 4376

Deduplication where footprints overlap is working. Clipping to footprint was executed according to the log, but seems like those polygons were not removed.

julietcohen commented 1 year ago

I believe the polygons flagged as duplicates were not removed because after we clip to footprint just after reading in the file and its footprint, we then execute add_properties() before we deduplicate by footprint, which still retained a step that is from the past deduplication approach:

        # Add the column to flag duplicate polygons. This will be set to True
        # later if duplicates are found.
        dedup_method = self.config.get_deduplication_method()
        if dedup_method is not None:
            # Mark all the polygons as not duplicated
            gdf[self.config.polygon_prop('duplicated')] = False

This step replaces the boolean column of the same name that identifies the polygons that were labeled as duplicated because they were clipped to footprint.

julietcohen commented 1 year ago

Removing the above code chunk did not resolve the issue that the polygons that were flagged as duplicates because they fell outside the footprint were still not removed. So there must be multiple places in the deduplication approach that overwrite those labeled polygons in the 'duplicated' column of the tile with just the duplicates that were identified from overlapping footprint areas. I adjusted dedupicate_by_footprint() so that the to_remove list that is created at the start is immediately populated with the subset of the input gdf that represents the polygons that were labeled as duplicates earlier, when we executed clip_gdf and then label_duplicates the first time.

    gdf = gdf.copy()

    # `to_remove` list will hold the polygons that fit either of the criteria:
    # 1. were previously labeled True for `duplicated` col because poly
    # fell outside footprint
    # 2. are labeled as True for `duplicated` col within this function
    # based on overlap of footprints
    to_remove = []

    # First, add the polygons that were already labeled as duplicates because
    # fell outside footprint
    known_dups = gdf[gdf['duplicated'] == True]
    to_remove.append(known_dups)

    logger.info(f"After initially adding the previously identified dups df to list to_remove, length is {len(to_remove)}\nand some values in list are {to_remove[0:10]}.")

    # Will hold the polygons that defined the footprint intersections
    intersections = []

The output webtiles looked the same. The search continues!

julietcohen commented 1 year ago

Deduplication successful for both polygons that fall outside footprint and those in overlapping footprint regions

See viz-staging commit.

julietcohen commented 1 year ago

Changing predicate for clipping to footprint

After successfully clipping to footprint and removing duplicates where the footprints overlap, the next step is to change the geopandas' sjon predicate for clipping to footprint in order to eliminate the light strips we see between some adjacent shapefiles when you zoom in and look real closely:

Instead of using 'within', which only retains the polygons that fall completely inside the footprint boundary, we want to retain polygons that fall completely within the footprint boundary and those that partially fall within the boundary, but hang over the edge. Changing it to 'intersects' did the trick!

robyngit commented 1 year ago

Wow, that looks fantastic @julietcohen !!!

julietcohen commented 1 year ago

Thanks Robyn!

julietcohen commented 1 year ago

Viz-workflow on Delta (with 1 CPU node) executes deduplication (almost?) as expected

Note: it is noticeable that there is a light strip (that we noticed we weren't seeing anymore after switching the clip_gdf predicate from 'within' to 'intersects'), along the border of the bottom 2 tiles. I wonder why the border of the upper 2 tiles does not show this, while the bottom does.

Regardless, this is a big improvement in the deduplication using the Ray workflow.

julietcohen commented 1 year ago

Next run on Delta:

increase number of adjacent input files because we want to test running the workflow in parallel on multiple nodes. With such little input data as shown above (3 shapefiles and their footprints), the viz-workflow only runs on 1 node during staging, and most of rasterization is on 1 node too, because by the time one or multiple worker nodes fire up, the head node has already rasterized most of the files
insert print statement (because logging.info() is still largely not working) to state when clipping is happening and the predicate used

julietcohen commented 1 year ago

Parallel ray workflow run on Delta: all Wrangel Island IWP detections

Setup:

4 CPU nodes
104 shapefiles (all of high/russia/226_227_iwp dir)

changed palette from 2 colors (yellow & pink) to 9 hex colors ranging from gray-red to bright red:

   ["#a15e5e", # gray-red
    "#aa5555",
    "#b44b4b",
    "#be4141",
    "#c93636",
    "#d42b2b",
    "#e31c1c",
    "#f50a0a",
    "#ff0000" # bright red]

viz-staging branch bug-17-clippingToFP
viz-raster branch bug-ID-notRecognized

staging:

during staging, only 2 nodes were producing staged files in their respective /tmp directories (head, and 1 worker)
13 min
head node was at ~20% CPU, ~22% mem
worker node that staged files was at ~8% CPU, ~17% mem
idle worker nodes we at ~0.2% CPU, ~5% mem
after merging, 50,193 staged files

raster highest:

during rasterization highest, all 4 nodes were producing geotiff files in their respective /tmp directories
head node was at ~25% CPU, ~16% mem
worker nodes were at ~25% CPU, ~15% mem but producing geotiffs slower than head node
interesting errors written, which don't make sense to me because the geotiff dir is created initially in each nodes' /tmp and is empty, but seems like a couple geotiffs are attempted to be written multiple times?
total number of geotiffs highest produced: 50,184 (so like 9 errored/did not get written)
6 min

raster lower:

68,517 geotiffs produced for all z-levels
2 minutes
did have to remove a few incorrectly formatted lines from the top of raster_summary.csv before executing web tiling, as usual

web tiles:

68,517 web tiles produced for all z-levels
6 min

Final output:

julietcohen commented 1 year ago

Relics of edge effects are still visible between tiles but only in the form of clipping to footprint, not where footprints overlap. However, these edge effects are not present between every tile, just every once in a while you spot one when scrolling around.

Then when you zoom in on some of the border strips, it's clear there are still polygons that were indeed retained that just fell on the edge of the footprint, which is what we want (we just want to remove the polys that fall completely outside the footprint). I wonder if these empty strips that are present between some tiles are unavoidable, no matter the predicate used.

julietcohen commented 1 year ago

Additional note: raster_summary.csv still shows the maximum value for iwp_coverage statistic as >1. I still have to remove 3 or 4 incorrectly formatted rows and re-upload the csv in between raster lower and web tiling.

The values >1 occurs on datateam (without parallelization) as well.

julietcohen commented 1 year ago

Comparing colors and transparency for IWP layer

green, 0.10 - full color

julietcohen commented 1 year ago

purple, 0.10 - full color

julietcohen commented 1 year ago

red, 0.1 - full color

julietcohen commented 1 year ago

yellow, 0.1 - full color

julietcohen commented 1 year ago

Plotting the footprints for these input files shows the area that was covered by input files (with partial transparency to show where footprints overlap). This shows that the strip of "missing" IWP on the map is a result of the lack of input data for that area, not errors in the visualization workflow.

mbjones commented 1 year ago

Nice -- let's talk to Elias and Chandi and figure out if this missing data is unexpected from their perspective.

julietcohen commented 1 year ago

The strip of missing tiles shown above is due to the 180th meridian, which falls on Wrangle Island. See here:

julietcohen commented 1 year ago

I finalized the changes to the new viz-staging deduplication approach today. This new approach runs smoothly on both Delta as well as Datateam with identical output. The logging in both the ray workflow and simple viz-workflow now works as well!

The branch bug-17-clippingToFP has also been tested to confirm it works when the dedup_method is set to None and clipping_to_footprint is set to False in the config (the result is what we expect with that config):

The neighbors deduplication approach has been modified slightly as well (essentially removing the label_duplicates() as a separate function, and the necessary parts have been tailored to and integrated into the neighbors method), but the neighbors method has not been thoroughly tested yet. We decided that it is higher priority to process all the IWP data before we test it. As a result, the the viz-staging package with these changes has been documented that the neighbors approach is not to be used until a release after 0.1.0. The release 0.1.0 is tailored to the footprints dedup method.

PermafrostDiscoveryGateway / viz-staging