spike: explore sorting edge cases

pnadolny13 commented 1 year ago

There was a discussion in https://github.com/MeltanoLabs/target-snowflake/pull/47 around the best way to deduplicate within a stream when a key property is set. The PPW variant uses the last arriving record for a PK as the one to use for merging, using that behavior as the base case for this target we implemented the same behavior using sorting in the merge query itself.

Like discussed in that PR thread we should explore better options for deduplicating and fully think through if there are edge cases that we might be missing with the current implementation.

visch commented 1 year ago

Saw this late https://github.com/MeltanoLabs/target-snowflake/pull/47 (fixed my email filters now :open_mouth: )

How I handle that in target-postgres is https://github.com/MeltanoLabs/target-postgres/blob/4f677d3f30ceaf8415dd1a401d22ab8aaf06e2a2/target_postgres/sinks.py#L140 , iterate through it in a dict (similar to the PPW variant it seems). I have a "test" here https://github.com/MeltanoLabs/target-postgres/blob/4f677d3f30ceaf8415dd1a401d22ab8aaf06e2a2/target_postgres/tests/test_standard_target.py#LL255C35-L255C35 , which I should add some expectations of the actual data instead of just a non failure, but it's definitely something to get added to the "main" test suite.

I think a pure sql approach is better from the performance side of things (no idea what the actual impact is of iterating in python is as I haven't performance tested the stuff yet)

A more "ansi" sql style might use rank() over, but I"m not certain if that's moresupported across the board for databases.

pnadolny13 commented 1 year ago

@visch yeah the PPW variant iterates using a python dictionary in batches too but idk if the SDK has already iterated that batch and youre doing it again or not 🤔, worth exploring as an optimization. I agree that sorting using SQL is probably most efficient, we ended up using a qualify and row_number in the our custom sql but it would be nice if we could tell sqlalchemy to do this in an agnostic way.

We also have a test https://github.com/MeltanoLabs/target-snowflake/blob/86a9ad00cc41b79d12a8eb21a5a726c2eaf00fbf/tests/core.py#L98 where we assert that the correct record out of the duplicates is chosen. It uses the duplicate_records.singer as the test data. Adding a default assertion to that SDK test would probably make sense.

MeltanoLabs / target-snowflake

spike: explore sorting edge cases #61