airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
717 stars 44 forks source link

fix: remove topic from semantic_hash and handle migration #801

Closed hzding621 closed 2 months ago

hzding621 commented 2 months ago

Summary

This PR removes topic (including both EventSource.topic, and EntitySource.mutationTopic) from the calculation logic of semantic_hash during offline join backfill logic.

detailed logic

migration plan

  1. release the change
  2. let airflow run for a few days with updated JAR version to migrate all semantic_hash for production join confs for production join confs without active airflow runs, manually synchronize the semantic_hash.
  3. update global kafka topic logic (mapping)
  4. for each team, create the codemod PR to update all configs.
    • At this point, the semantic_hash should already been migrated. if not, the user might run into the manual archive exception, which is fine.
    • some users may have lingering local configs, which might bring old topic format back to master. add a CI check in config repo to prevent that.

Why / Goal

  1. topic should NOT be part of semantic_hash because it never affects offline behavior
  2. this will unblock future migration of kafka topic format, without affecting semantic_hash and triggering a bunch of backfills.

Test Plan

Integration Test Plan

✅ test case 1 (also in UT):

  1. create a test join config using old kafka format
  2. run the join job using prod JAR
  3. run the join job again using dev JAR
    • we expect no diffs or archives happen
    • we expect the semantic_hash to be migrated, and flag set in hive
  4. run the join job yet again using dev JAR
    • we expect no diffs nor archive.
  5. update the test join config with new kafka topic format
  6. rerun the join job using dev JAR
    • we expect no archive to happen

✅ test case 2:

  1. create a test join config using old kafka format
  2. run the join job using prod JAR
  3. update the test join config with new kafka topic format (note that the hash migration is not done)
  4. rerun the join job using dev JAR
    • we expect the job failed with manual archive error
  5. rerun the join job using dev JAR with unset flag
    • we expect the job to succeed, hash migrated and flag set

✅ test case 3 (also in UT):

  1. create a test join config using old kafka format
  2. run the join job using prod JAR
  3. update the test join config to introduce some real semantic changes
  4. rerun the join job using dev JAR
    • we expect correct diffs produced
    • we expect the job failed with manual archive error
  5. manually drop the tables
  6. rerun the join job using dev JAR after manual dropping
    • we expect the job to succeed, hash migrated and flag set
  7. update the test join config to use new kafka topic format
  8. rerun the join job using dev JAR
    • we expect no archive to happen

✅ test case 4:

  1. create a test join config old kafka format
  2. run the join job using prod JAR
  3. run the join job again using dev JAR
    • we expect the semantic_hash to be migrated, and flag set in hive
  4. update the test join config to introduce some real semantic changes
    • we expect correct diffs produced
    • we expect auto archive to take place

Reviewers

@donghanz @yuli-han @pengyu-hou

yuli-han commented 2 months ago

Hi @hzding621 my understanding on this is: we want to provide a way for users to update topic without triggering the archive. Could you clarify what is the new workflow if users want to update the topic without triggering an archive? Do they need to set the exclude_topic flag in hive table and update the topic at the same time? Also when we load the old semantic hash how do we avoid the diff with new hash when the topic is updated and exclude_topic flag is set? I am confused of this line. Should we perform an archive or not here? [auto archive] if exclude_topic did exist, we perform auto archive as usual