Talent-Catalog / talentcatalog

https://tctalent.org
GNU Affero General Public License v3.0
11 stars 4 forks source link

Release v2.2.2 planning #921

Open camerojo opened 5 months ago

camerojo commented 5 months ago

As the tech team we need to capture somewhere special manual tasks that have to be carried out for a release so that we don't forget

Typically we will carry out these tasks progressively over time on staging, but all at once in production.

We should add the tasks as a task list as we carry them out on staging, then when we release into production, we just copy the task list, and mark them off as we complete them on production.

This type of issue starts in Review/QA so that we can always easily find it. It is only moved to Done when the production release is completed.

Staging task list

Post-release:

New Translations

Cloud Cache

Build Elasticsearch Index with Infra scripts

Run SystemAdminAPI fix_null_case_sfids

Salesforce tidy up

See #1335

Instructions: Production task list

To be copied from staging task list above, then worked through during the production release. Translation text can be copied from Staging.

Production task list

Post-release:

Cloud Cache

camerojo commented 1 week ago

Release problems

Github Jib build

Flyway failed

Fix 1 involved deleting visa job check records which were causing the SQL to return multiple values when just one was expected. The deleted records were stored in this TSV file - changed to txt because GitHub does not support uploads of tsv files. It has the candidate_visa_job_checks that we deleted to get prod to start.

tctalent_public_candidate_visa_job_check_6.txt

That helped - but the Flyway was still failing.

Fix 2. There was other bad data which caused the update to fail: multiple visa job checks for the same job. We executed queries to identify the bad data, then manually deleted the bad entries using the TC.

Screenshot 2024-09-28 at 10 27 36 AM

Timeout problem on the Elasticsearch scripts

camerojo commented 1 week ago

I reimported the deleted visa job checks from the above dump file to see if I could fix the data.

There was just one multiple value being returned from update select.

It turns out that this last problem was not due to bad data but a problem in the update SQL.

The problem came from the fact that a candidate can have more than one visa check (NOT visa JOB check) carried out based on country. For example a candidate may have been checked for basic eligibility to both Australia and Canada. For example see https://tctalent.org/admin-portal/candidate/76864

In that the case the Update SQL from Flyway will fail because multiple values will be returned in the select - one for the visa check for Australia and one for the visa check for Canada. The correct update SQL needs to add an extra restriction to the select's where clause specifying that the visa check country should match the job country.

update candidate_opportunity co set relocating_dependant_ids =
    (select relocating_dependant_ids from candidate_visa_job_check cvjc
          join salesforce_job_opp jo on cvjc.job_opp_id = jo.id
          join candidate_visa_check cvc on cvjc.candidate_visa_check_id = cvc.id
          join candidate c on cvc.candidate_id = c.id
     where co.candidate_id = c.id and co.job_opp_id = cvjc.job_opp_id 
-->     and jo.country_object_id = cvc.country_id);    <--

I manually ran the above update against the production database (with the previously deleted data reimported) and it ran without error.

@cazcam34 @sadatmalik

In summary

There were actually two reasons that the Flyway failed:

  1. Bad data in the production database left over from a previous bug. This was duplicate visa job checks for the same job. Removing the duplicates fixed the problem.
  2. An error in the Flyway SQL which did not account for the fact that there can be multiple candidate visa checks - for different countries. Manually running corrected SQL on the production database fixed that problem.
camerojo commented 1 week ago

Forgot to add password for new email noreply@talentcatalog.net to prod Task deployment. Should have been in these release planning notes but I forgot to add it.

Updated now in prod task definition (40) - and restarted prod.

camerojo commented 1 week ago

Two showstopper bugs surfaced today which have been fixed and pushed into production.

sadatmalik commented 1 week ago

Candidate synch started:

Screenshot 2024-10-01 at 16.31.09.png

For 119469 candidates:

Screenshot 2024-10-01 at 16.31.40.png

Process 1 - 598 pages:

Screenshot 2024-10-01 at 16.32.42.png

This is the last log entry - for page 597:

Screenshot 2024-10-01 at 16.40.57.png

We seem to be missing the last page and the summary log message at the end of the sync.

sadatmalik commented 1 week ago

These are the 10,000 we tested with - they completed fully with the corresponding log completion:

Screenshot 2024-10-01 at 16.54.12.png

This is the scheduled sync which seems to have stopped at page 597 of 598:

Screenshot 2024-10-01 at 16.55.08.png
sadatmalik commented 1 week ago

Found the culprit - Out of Heap space:

Screenshot 2024-10-01 at 19.51.09.png