PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 4 forks source link

Integrate option to deduplicate data before tiling #55

Open julietcohen opened 4 months ago

julietcohen commented 4 months ago

Currently, deduplication in the visualization workflow starts after the input data has been staged and tiled. If deduplication is set to occur at any step in the workflow (staging, rasterization, and/or 3D tiling), then the duplicate rows are flagged with a boolean attribute, then the polygons that are True for that attribute are removed at the specified step.

For some datasets, deduplicating the data before it is tiled could be beneficial. For example, Ingmar Nitze's Arctic lake change dataset is composed of UTM zones that overlap at the edges, and he prefers to have the data deduplicated before it is input into the viz-workflow. That way, whether users are interested in the viz output (tilesets of lakes) or the input data, they can have access to only the deduplicated data.

This functionality is in the exploratory phase. An example of applying of the neighbor deduplication approach to non-tiled data can be found in this issue. One way this functionality could be integrated into the viz-staging package is by adding more acceptable inputs for the deduplication options in the config. An example: deduplicate_at could accept a new option like "before_tiling". In addition to new flexibility in the config, certain pre-deduplication steps would need to happen such as adding a source_file attribute to the input data.

julietcohen commented 2 months ago

One more consideration for this feature is that any polygons that intersect the antimeridian in the input data will need to be split prior to deduplication, which is cohesive with the need for them to be split before we stage the files anyway. This was identified with the lake change data (see here). I included an example of how to do this in R here, and example in Python is here.