aws-samples / aws-glue-flatten-nested-json

MIT No Attribution
50 stars 12 forks source link

Q: Job Bookmarking #12

Closed shyamal-anadkat closed 6 days ago

shyamal-anadkat commented 2 years ago

@fnapolitano73 @manashdeb quick q: how does the current job work with bookmarking? (so it doesn't create duplicated rows for already processed data), Thanks!

fnapolitano73 commented 2 years ago

Hi , thank you for the question, as mentioned in the documentation:

Job bookmarks are implemented for JDBC data sources, the Relationalize transform, and some Amazon Simple Storage Service (Amazon S3) sources. The following table lists the Amazon S3 source formats that AWS Glue supports for job bookmarks.

So, if you enable bookmarking and your sources support it, then the "surrogate key" (index) for each of the relationalized nested level will be tracked and when only new data will be added the index will continue from the last position.

Please notice that, if you have a nested value that it was already relationalized in its own table, and you are not automatically joining the data, the child table may have duplicates with different keys, that are joined through that different key to the parent table.

If you want a clean environment some post processing might be needed to remove duplicates and update correctly the parent tables foreign keys.

if you join back your lines than this is not an issue (after the join you can actually drop the index/surrogate-key).

hope this helps Fabrizio