Add a deduplication method that is based on file footprints

robyngit commented 2 years ago

To make a spatially consistent product, we would like a deduplication method that keeps polygons from only one file in areas where two or more files overlap. In these areas of file overlap, we should remove all polygons that are not from the preferred file (e.g. the newest file.)

We should think about the following when coming up with a solution:

To keep the deduplication process generalizable for pipelines where data steams in, we will not know the footprints of all files ahead of time.
When should deduplication happen? Options:
- As soon as a new file is read in (then we could compare it to the footprints of all other files, if that's known)
- After the file has been split into tiles (what is considered an area of overlap within a tile?)
How do we get an accurate "footprint", since we don't have access to the original image files? Options:
- Use the convex hull of the polygons - this will be smaller than the footprint of the source image (especially if there are large areas at the boundary of an image where there were no polygons detected.)
- Use the footprint from the tiff files that generated the shapefiles? (is this method generalizable? e.g. will future data products also have a raster version?)
- Require a 'footprint' dataset that is generated from the original image files to be provided along with the shapefiles?
How should we go about identifying the preferred file? We could use:
- Properties of the polygons - the property would need to be consistent across all polygons within the same file (e.g. Date in the Ice Wedge Polygons case), or we would need to calculate the mean/max/min/etc. of the property for all polygons within the file and compare. (And in that case, should we compare across the entire file, within tiles, or within areas of overlap?)
- Properties of the file itself (e.g. file name, file creation date, modification date)
- Use a list of filenames sorted according to preference, or a table with filenames matched to properties of those files (provided by the scientist who generated the shapefiles, must be updated each time a file is added to the pipeline).
Is there a way to avoid an artificially low density/area of polygons at the boundaries of files?

robyngit commented 2 years ago

Some notes about deduplication in the Ice Wedge Polygon case:

It looks like the footprints of the source satellite imagery have already been recorded, along with other metadata about the imagery, at least for a portion of the data.

For example, here is a plot of the file pdg/data/Arctic-Imagery/high_ice/4519_1May2021/Chandi_Alaska_Imagery_2021apr22.shp from the datateam server:

Click to view image

Along with an example of some of the metadata available:

Click to view metadata
``` strip_id WV03_1040010051572D00_M1BS_504794575040_01 scene_id WV03_20190902222531_1040010051572D00_19SEP0222... status tape sensor WV03 catalog_id 1040010051572D00 order_id 504794575040_01_P001 prod_code M1BS country US spec_type Multispectral acq_time 2019-09-02T22:25:31.987150+00:00 cloudcover 0.001 cent_lat 70.283835 cent_long -148.518372 bands 8 columns 12288 rows 11264 bits_pixel 16 file_fmt NITF off_nadir 15.8 sun_elev 27.5 exposure 0.0002 scan_dir Reverse coll_gsd 1.325 prod_gsd 1.323 det_pitch 0.032 line_rate 5000.0 ref_height 9.0 abscalfact None bandwidth None tdi None xtrackva -12.7 intrackva 7.8 o_filename None o_filepath None o_drive None s_filename WV03_20190902222531_1040010051572D00_19SEP0222... s_filepath http://blackpearl-data2.pgc.umn.edu/scenes/WV0... previewjpg None previewurl https://api.discover.digitalglobe.com/show?id=... rcvd_date 2020-11-04 file_sz 0.42301 FID_ 0 wkb_geomet 0.0 Shape_Leng 1.01093 geometry POLYGON ((-148.70355246999998 70.3536017700000... ```

A file like this is exactly what we need to 1) Identify the footprint for a given file (since the geometry of the footprint is matched to the shapefile with the s_filename property); and 2) rank files according to preference (we could use one of the properties in the footprint file, e.g. newest acq_time)
After inspecting this file, I think it might be worthwhile to remove some of the duplicate files before we run everything through the viz pipeline for the first time. Here is a close up of the footprint file where there are > 10 files overlapping:

Update:

The boundaries in the Chandi_Alaska_Imagery_2021apr22.shp file don't exactly match the current version of the IWP shapefiles. It looks like maybe the original files have been clipped. Here are a few examples where each overlapping file is in red and blue, and the boundaries from the Chandi_Alaska_Imagery_2021apr22.shp file are shown in green:

overlap_plot_3

robyngit commented 2 years ago

The deduplicate by footprint method is now working, but requires a directory of footprint vector files to work. Once we have these generated for the IWP data, we should be able to avoid all overlap between between files. See documentation here and here

PermafrostDiscoveryGateway / viz-staging

Add a deduplication method that is based on file footprints #3

Some notes about deduplication in the Ice Wedge Polygon case:

Update: