bloom-housing / bloom

Bloom is Exygy’s affordable housing platform. Bloom's goal is to be a single entry point for affordable housing seekers and a hub for application and listing management for developers.
https://bloomhousing.com
Apache License 2.0
33 stars 25 forks source link

feat: new cron job for duplicates #4230

Closed ludtkemorgan closed 1 month ago

ludtkemorgan commented 3 months ago

This PR addresses #4208

Description

Creates a new and improved duplicates (application flagged set) cron job. This tackles the issues identified here.

How the duplicates are grouped were decided on via a slack poll. Option 2 won https://exygy.slack.com/archives/C01Q4QG5R8Q/p1724708380283439.

Since this will interfere with the listings that have had the old cron job run there is a new date that we set via env variable. Any listing that closes before that date/time will have the old cron job run against it whereas any listing closed after it will have the new one. This way any partner that has resolved applications but a new application comes through they will not have to start from the beginning again. And if the new close date is in the future (or not set) only the old job will run.

Questions:

Notes:

How Can This Be Tested/Reviewed?

This can be tested both purely with the backend as well as using the partner site.

Prerequisites for both ways:

Using the partner site:

Using the backend:

Author Checklist:

Review Process:

netlify[bot] commented 3 months ago

Deploy Preview for partners-bloom-dev ready!

Name Link
Latest commit dd24981b17b119a9fd1db5f977dbddab91bd3fa6
Latest deploy log https://app.netlify.com/sites/partners-bloom-dev/deploys/66e8963da7555d0008be9295
Deploy Preview https://deploy-preview-4230--partners-bloom-dev.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] commented 3 months ago

Deploy Preview for bloom-exygy-dev ready!

Name Link
Latest commit dd24981b17b119a9fd1db5f977dbddab91bd3fa6
Latest deploy log https://app.netlify.com/sites/bloom-exygy-dev/deploys/66e8963ddd50b60008d9d09c
Deploy Preview https://deploy-preview-4230--bloom-exygy-dev.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ColinBuyck commented 3 months ago

Am I remembering our conversation that I should expect this to be flagged as nameEmailAndDOB or something acknowledging that they match on all three cases?

Screenshot 2024-07-31 at 1 20 47 PM
ludtkemorgan commented 3 months ago

@ColinBuyck are you still seeing the above? I pushed up a change with the UI change mentioned via slack

ludtkemorgan commented 3 months ago

@ColinBuyck and team I did some investigation. When submitting a paper application we are setting the reviewStatus to "valid" which I'm assuming is intentional. So when you go to resolve a pending application flagged set that has a paper application the paper application will be preselected as "valid".

ColinBuyck commented 3 months ago

Nit but I'm surprised to see these two references to the combined rule differ

Screenshot 2024-08-01 at 10 30 24 AM Screenshot 2024-08-01 at 10 30 18 AM
ColinBuyck commented 3 months ago

Do these serve different roles? Also I keep getting traffic not from a known source despite signing in before clicking execute. Am I missing something?

Screenshot 2024-08-01 at 10 45 11 AM
ColinBuyck commented 3 months ago

I'm also noticing the following behavior and feeling a bit confused Reseed and run, Close District View Apartments Add an application with first3, last3, user3@example.com, and 01/01/1970 Run process_duplicates I see two separate duplicate sets. Is that expected? Screenshot 2024-08-05 at 10 42 00 AM Screenshot 2024-08-05 at 10 41 50 AM

emilyjablonski commented 3 months ago

To your note in the PR description about applications across multiple sets, that seems to be a significant cause of mistakes when resolving sets, as described in the multiple set resolution section in this doc. I don't see a solution to that captured in the TDD, do you have a sense of what a solution for that might be? Is it expected that that work is coming later?

emilyjablonski commented 3 months ago

@ludtkemorgan is this on hold pending the conversations in slack or is it ready for review?

ludtkemorgan commented 3 months ago

@ludtkemorgan is this on hold pending the conversations in slack or is it ready for review?

@emilyjablonski I'd recommend to not review this until it is decided which pattern we will be doing since it could be a pretty big refactor if we choose the second route

emilyjablonski commented 2 months ago
Screenshot 2024-08-28 at 12 33 20 PM Screenshot 2024-08-28 at 12 33 32 PM

This is the data from the first example in the duplicates doc - I think the first and last set could be confusing. They're Email + Name + DOB, with 3/4 of the first set overlapping - is that expected? If they share apps between sets under the same rule, would that not put them in the same set?

emilyjablonski commented 2 months ago

I know we talked about doing it separately, but since we're going to end up with more across multiple sets I think we need to release this with a frontend update as well, even if done separately. This was Em's suggestion from the workshop: (1) See if there are ways we can remove them from the UI if the data would remain accurate (2) Otherwise, disable/lock resolved duplicates in the UI. I found it to be really confusing to go into a second set with applications already marked as a duplicate. I knew I had marked them elsewhere, but when trying to see what was a duplicate in a second set, that duplicate application looked valid in that context. I actually went to mark it as valid, which seems to be an error property managers are making too.

ludtkemorgan commented 2 months ago

@emilyjablonski Good catch with that application combo. There should just be 2 sets with those 8 applications. One with Mira and Boba for email and the other with all of the others. Due to the way I'm looping through the applications it's not properly deduping some. I'll work on a fix. If there are only two sets with no overlap between them, do you still think a front-end change is required?

emilyjablonski commented 2 months ago

@ludtkemorgan I think w no overlap or the same we have now, would not require a frontend change!

ludtkemorgan commented 2 months ago

This PR is ready to be re-reviewed

@emilyjablonski I have done a refactor to catch the scenario you mentioned. It ended up being a fairly big change so I'd recommend doing a full e2e test of it. I also added a test called should create multiple flag sets with chaining of flags that uses the same setup of apps you created (replacing dog names with numbers).

@YazeedLoonat

emilyjablonski commented 2 months ago

Trying to test what happens with additional runs of the process job after more application submissions. How are you executing the job from the swagger docs with "Traffic not from a known source" errors? To trigger the job through the docs, I just commented that guard out but curious if you have a different method.

emilyjablonski commented 2 months ago

All of the examples from the doc look awesome! I know it would be quite a lift to set up, but I'm wondering what you think about running this on the Fremont Family set? It's a little challening to test all the edge cases

ludtkemorgan commented 2 months ago

@emilyjablonski That guard only happens if you have the env variable API_PASS_KEY set. I have just been testing with not having that environment variable.

I tested it with the Albany property. But Fremont is a good idea! I'll do that

YazeedLoonat commented 2 months ago

Hey @ludtkemorgan I'm still seeing some weird behavior if email is missing or birth day is missing

when I was looking at the database what I was seeing was the the key became NULL which was causing more matching and condensing of dupe sets into 1 large dupe set instead of many smaller sets like I was expecting

after reading some documentation I think its a quirk of postgres where the || operator we are using in the view causes the string to go to NULL if a null value is found. https://www.postgresqltutorial.com/postgresql-string-functions/postgresql-concat-function/

it may be better for us to switch to the concat function in the view to concat strings together.

I also think for the emails doing something like COALESCE() for the email creating the email key may make the null email case less all encompassing like it is right now

lemme know what ya think!

ludtkemorgan commented 2 months ago

@YazeedLoonat I'm open to switching to concat and/or coalesce. But first I want to make sure I'm able to reproduce what you are seeing. I tried creating applications with both missing email and name/dob, I see the null entries in the database however on the frontend it appears to be handling everything correctly. Can you lay out the steps you took and what you are seeing after the job runs?

YazeedLoonat commented 2 months ago

Hey @ludtkemorgan okay here are the steps I took:

after running the job I would expect there to be 2 different groups of duplicates. 1 for the name match and 1 for the email match, but I suspect because of the null dob playing into it its creating 1 mega group

ludtkemorgan commented 1 month ago

@YazeedLoonat nice find. I switched to CONCAT for the name/dob and it appears to have fixed the problem.

Also, since we don't require email address in the application flow I don't think we should flag null email duplicates. So I believe what I have now covers all of the null cases correctly.

Let me know if that is not the case