TechAndCheck / tech-and-check-alerts

Daily tip sheet for fact checkers
MIT License
13 stars 6 forks source link

Add CNN deduplication to national newsletter #342

Closed slifty closed 4 years ago

slifty commented 4 years ago

Description

This PR adds logic to remove certain types of duplicate claims from appearing in the CNN portion of the national newsletter.

Specifically:

  1. Makes sure that our JOIN against known speakers does not accidentally create duplicate claims in the event that there are duplicate known speakers in the table.

  2. Adds an additional clause to ensure that selected claims that have duplicates only result in the FIRST copy of the claim in the time window.

The query for the newsletter is starting to get a bit ridiculous, and may need to be refactored in the near future.

Due Diligence Checklist

Steps to Test

  1. yarn test
  2. yarn newsletter:send-test --national

Deploy Notes

None

Related Issues

Related to #109 -- We might want to mark it as resolved, though this doesn't cover all possible types of duplicate.

reefdog commented 4 years ago

Hey first thing, could you rebase this from master so it picks up the new database schema from #331? Having trouble running the scrapes otherwise.

…which is a bit of a reviewing 🤔 because ideally we wouldn't have to do a migration/rollback dance when changing branches.