feat: new cron job for duplicates

ludtkemorgan commented 3 months ago

This PR addresses #4208

[x] Addresses the issue in full
[ ] Addresses only certain aspects of the issue

Description

Creates a new and improved duplicates (application flagged set) cron job. This tackles the issues identified here.

How the duplicates are grouped were decided on via a slack poll. Option 2 won https://exygy.slack.com/archives/C01Q4QG5R8Q/p1724708380283439.

Since this will interfere with the listings that have had the old cron job run there is a new date that we set via env variable. Any listing that closes before that date/time will have the old cron job run against it whereas any listing closed after it will have the new one. This way any partner that has resolved applications but a new application comes through they will not have to start from the beginning again. And if the new close date is in the future (or not set) only the old job will run.

Questions:

If a new application comes through that matches an existing flagged set do we need to set the flagged set status back to "pending" so that it needs to be re-resolved?

Notes:

A new view is created that has all of the possible flagged sets in the system. A query is done against this to find the duplicates.
We previously required the rule_key to be unique. The way we ensured that it was unique was that each key was prefixed with the listing id. This seemed unnecessary since we are already storing the listing id. But in order to make sure the rows are unique this PR also removes the unique constraint and adds a compound unique constraint to the listing_id and rule_key

How Can This Be Tested/Reviewed?

This can be tested both purely with the backend as well as using the partner site.

Prerequisites for both ways:

Reseed the backend (more applications have been added to the District View listing)
Add new environment variables to .env for backend
- DUPLICATES_CLOSE_DATE="2024-07-28 00:00 -08:00"
- DUPLICATES_PROCESSING_CRON_STRING=0

Using the partner site:

Sign into the partner site and go to the District View listing.
Close the listing and go to the "applications" tab
Notice the pending duplicate sets and resolve the applications as seen fit

Using the backend:

After a reseed either wait for the cron job to run or go to the swagger doc and execute the cron job
Look in the DB and verify that the appropriate Application Flagged sets exist and along with the connected applications

Author Checklist:

[ ] Added QA notes to the issue with applicable URLs
[x] Reviewed in a desktop view
[x] Reviewed in a mobile view
[x] Reviewed considering accessibility
[x] Added tests covering the changes
[ ] Made corresponding changes to the documentation
[x] Ran yarn generate:client and/or created a migration when required

Review Process:

Read and understand the issue
Ensure the author has added QA notes
Review the code itself from a style point of view
Pull the changes down locally and test that the acceptance criteria is met
Either (1) explicitly ask a clarifying question, (2) request changes, or (3) approve the PR, even if there are very small remaining changes, if you don't need to re-review after the updates

netlify[bot] commented 3 months ago

Deploy Preview for partners-bloom-dev ready!

Name	Link
Latest commit	dd24981b17b119a9fd1db5f977dbddab91bd3fa6
Latest deploy log	https://app.netlify.com/sites/partners-bloom-dev/deploys/66e8963da7555d0008be9295
Deploy Preview	https://deploy-preview-4230--partners-bloom-dev.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] commented 3 months ago

Deploy Preview for bloom-exygy-dev ready!

Name	Link
Latest commit	dd24981b17b119a9fd1db5f977dbddab91bd3fa6
Latest deploy log	https://app.netlify.com/sites/bloom-exygy-dev/deploys/66e8963ddd50b60008d9d09c
Deploy Preview	https://deploy-preview-4230--bloom-exygy-dev.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ColinBuyck commented 3 months ago

Am I remembering our conversation that I should expect this to be flagged as nameEmailAndDOB or something acknowledging that they match on all three cases?

ludtkemorgan commented 3 months ago

@ColinBuyck are you still seeing the above? I pushed up a change with the UI change mentioned via slack

ludtkemorgan commented 3 months ago

@ColinBuyck and team I did some investigation. When submitting a paper application we are setting the reviewStatus to "valid" which I'm assuming is intentional. So when you go to resolve a pending application flagged set that has a paper application the paper application will be preselected as "valid".

ColinBuyck commented 3 months ago

Nit but I'm surprised to see these two references to the combined rule differ

ColinBuyck commented 3 months ago

Do these serve different roles? Also I keep getting traffic not from a known source despite signing in before clicking execute. Am I missing something?

ColinBuyck commented 3 months ago

I'm also noticing the following behavior and feeling a bit confused Reseed and run, Close District View Apartments Add an application with first3, last3, user3@example.com, and 01/01/1970 Run process_duplicates I see two separate duplicate sets. Is that expected? Screenshot 2024-08-05 at 10 42 00 AM Screenshot 2024-08-05 at 10 41 50 AM

emilyjablonski commented 3 months ago

To your note in the PR description about applications across multiple sets, that seems to be a significant cause of mistakes when resolving sets, as described in the multiple set resolution section in this doc. I don't see a solution to that captured in the TDD, do you have a sense of what a solution for that might be? Is it expected that that work is coming later?

emilyjablonski commented 3 months ago

@ludtkemorgan is this on hold pending the conversations in slack or is it ready for review?

ludtkemorgan commented 3 months ago

@ludtkemorgan is this on hold pending the conversations in slack or is it ready for review?

@emilyjablonski I'd recommend to not review this until it is decided which pattern we will be doing since it could be a pretty big refactor if we choose the second route

emilyjablonski commented 2 months ago

This is the data from the first example in the duplicates doc - I think the first and last set could be confusing. They're Email + Name + DOB, with 3/4 of the first set overlapping - is that expected? If they share apps between sets under the same rule, would that not put them in the same set?

emilyjablonski commented 2 months ago

I know we talked about doing it separately, but since we're going to end up with more across multiple sets I think we need to release this with a frontend update as well, even if done separately. This was Em's suggestion from the workshop: (1) See if there are ways we can remove them from the UI if the data would remain accurate (2) Otherwise, disable/lock resolved duplicates in the UI. I found it to be really confusing to go into a second set with applications already marked as a duplicate. I knew I had marked them elsewhere, but when trying to see what was a duplicate in a second set, that duplicate application looked valid in that context. I actually went to mark it as valid, which seems to be an error property managers are making too.

ludtkemorgan commented 2 months ago

@emilyjablonski Good catch with that application combo. There should just be 2 sets with those 8 applications. One with Mira and Boba for email and the other with all of the others. Due to the way I'm looping through the applications it's not properly deduping some. I'll work on a fix. If there are only two sets with no overlap between them, do you still think a front-end change is required?

emilyjablonski commented 2 months ago

@ludtkemorgan I think w no overlap or the same we have now, would not require a frontend change!

ludtkemorgan commented 2 months ago

This PR is ready to be re-reviewed

@emilyjablonski I have done a refactor to catch the scenario you mentioned. It ended up being a fairly big change so I'd recommend doing a full e2e test of it. I also added a test called should create multiple flag sets with chaining of flags that uses the same setup of apps you created (replacing dog names with numbers).

@YazeedLoonat

I added a check in the view to ignore deleted apps and those scenarios are no longer happening.
The scenario of rerunning after resolution and editing an application I was not able to reproduce. Potentially the other changes I did
I have fixed the null scenario. It still flags applications that are missing name or email if there is more than one of them, but that should be intended. It just shows up as "null null" on the flagged set view on partners.

emilyjablonski commented 2 months ago

Trying to test what happens with additional runs of the process job after more application submissions. How are you executing the job from the swagger docs with "Traffic not from a known source" errors? To trigger the job through the docs, I just commented that guard out but curious if you have a different method.

emilyjablonski commented 2 months ago

All of the examples from the doc look awesome! I know it would be quite a lift to set up, but I'm wondering what you think about running this on the Fremont Family set? It's a little challening to test all the edge cases

ludtkemorgan commented 2 months ago

@emilyjablonski That guard only happens if you have the env variable API_PASS_KEY set. I have just been testing with not having that environment variable.

I tested it with the Albany property. But Fremont is a good idea! I'll do that

YazeedLoonat commented 2 months ago

Hey @ludtkemorgan I'm still seeing some weird behavior if email is missing or birth day is missing

when I was looking at the database what I was seeing was the the key became NULL which was causing more matching and condensing of dupe sets into 1 large dupe set instead of many smaller sets like I was expecting

after reading some documentation I think its a quirk of postgres where the || operator we are using in the view causes the string to go to NULL if a null value is found. https://www.postgresqltutorial.com/postgresql-string-functions/postgresql-concat-function/

it may be better for us to switch to the concat function in the view to concat strings together.

I also think for the emails doing something like COALESCE() for the email creating the email key may make the null email case less all encompassing like it is right now

lemme know what ya think!

ludtkemorgan commented 2 months ago

@YazeedLoonat I'm open to switching to concat and/or coalesce. But first I want to make sure I'm able to reproduce what you are seeing. I tried creating applications with both missing email and name/dob, I see the null entries in the database however on the frontend it appears to be handling everything correctly. Can you lay out the steps you took and what you are seeing after the job runs?

YazeedLoonat commented 2 months ago

Hey @ludtkemorgan okay here are the steps I took:

New Listing (no prio apps)
submit app 1 through the partner portal with Name Match as both the first and last name. No email, no DOB
submit app 2 through partner portal with Name Match as both the first and last name. No email, no DOB
submit app 3 through partner portal with Name 1 as both first and last name. email is emailMatch@gmail.com no DOB
submit app 4 through partner portal with Name 2 as both first and last name. email is emailMatch@gmail.com no DOB

after running the job I would expect there to be 2 different groups of duplicates. 1 for the name match and 1 for the email match, but I suspect because of the null dob playing into it its creating 1 mega group

ludtkemorgan commented 1 month ago

@YazeedLoonat nice find. I switched to CONCAT for the name/dob and it appears to have fixed the problem.

Also, since we don't require email address in the application flow I don't think we should flag null email duplicates. So I believe what I have now covers all of the null cases correctly.

Let me know if that is not the case

bloom-housing / bloom