Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

Data Refresh - Change the way LastUpdateDateTime column in the DonorManagementLogs table (A/B databases) is populated during data reshresh #1318

Closed daria-sorokina-da closed 2 months ago

daria-sorokina-da commented 2 months ago

Context

Currently, during Data Refresh, the LastUpdateDateTime column in the DonorManagementLogs in transient (A/B) databases is set to the date and time of Date Refresh processing.

Example: in the WMDA Live environment the last Data Refresh happened on 2024-04-26, and in the DonorManagementLogs we see lots of records with LastUpdateDateTime 2024-04-26, see: image

Here "-1" in the SequenceNumberOfLastUpdate indicates that the Data Refresh process created this record.

Issue description

This approach causes issues during Repeat Searches run after the data refresh. That is because all the donors added during Data Refresh now have LastUpdateDateTime after the last cut-off date of the search. So basically Repeat Searches perform the search from scratch (return all the same donors as the original search, as if they changed).

This causes performance issues on Atlas side and increases the costs in Azure, and causes performance issues on the client side as well, as the client does not expect the Repeat Searches to return lots of results.

Proposed fix

We need to start filling the LastUpdateDateTime with the date when the donor was last updated, instead of the Data Refresh date. We should take the date from the persistent database Donors.Donors table - LastUpdated.

Solution details

  1. In Atlas.DonorImport.Data.Repositories DonorReadRepository.StreamAllDonors update the returned model - add LastUpdated date to it.
  2. Start using the LastUpdated in Atlas.MatchingAlgorithm.Services.DataRefresh.DonorImport DonorImporter.InsertDonorBatch - instead of UpdateDateTime = batchFetchTime we should now have UpdateDateTime = d.LastUpdated
daria-sorokina-da commented 2 months ago

Testing Results

In Anthony Nolan DEV environment:

  1. a new Data Refresh with “forcedDataRefresh“: true
  2. It finished successfully, seeing results in [MatchingAlgorithmPersistent].[DataRefreshHistory] table: Id Database WasSuccessful RefreshRequestedUtc
    113 DatabaseA 1 2024-06-26 10:09:58.8216793
  3. Checking transient database A [dbo].[DonorManagementLogs]

Expected: The value in [LastUpdateDateTime] column is filled for all donors, and it is the same as in the persistent [Donors].[Donors] database [LastUpdated] column for that donor.

Actual: The value in [LastUpdateDateTime] column is "0001-01-01 00:00:00.0000000 +00:00" for all donors. See: Image

Additional Notes: Records produced by subsequent Donor Updates (outside of Data Refresh process) have the same issue: Image

$\textcolor{red}{\textsf{Testing failed}}$

daria-sorokina-da commented 2 months ago

Re-Testing Results after Fix

  1. a new Data Refresh with “forcedDataRefresh“: true
  2. It finished successfully, seeing results in [MatchingAlgorithmPersistent].[DataRefreshHistory] table: Id Database WasSuccessful RefreshRequestedUtc
    114 DatabaseB 1 2024-07-01 14:46:27.9961012
  3. Checking transient database B [dbo].[DonorManagementLogs]

Expected: There are entities in the [dbo].[DonorManagementLogs] table. The table is empty.

Actual: There are no entities in the [dbo].[DonorManagementLogs] table. The entities are present in the [dbo].[Donors] table though.

$\textcolor{red}{\textsf{Testing failed}}$

Investigation Result

This happened because when a nomenclature version does not changes since the last successful Data Refresh, then the Data Refresh is run with ShouldMarkAllDonorsAsUpdated = false setting, and the [dbo].[DonorManagementLogs] table does NOT get populated. This is incorrect behavior, because in this case Repeat Searches will not be returning donors that changed between the last search, and the Data Refresh (several days may pass between them).

Suggested Fix

Always set ShouldMarkAllDonorsAsUpdated = true for all data refreshes as part of 2.2.0. In the future (Atlas 3+), ShouldMarkAllDonorsAsUpdated flag should be removed from the Data Refresh parameters and from the [MatchingAlgorithmPersistent].[DataRefreshHistory] database.

daria-sorokina-da commented 2 months ago

Re-Testing Results after Fix

Scenario 1

  1. A new Data Refresh with “forcedDataRefresh“: true
  2. It finished successfully, seeing results in [MatchingAlgorithmPersistent].[DataRefreshHistory] table: Id Database WasSuccessful RefreshRequestedUtc
    115 DatabaseA 1 2024-07-02 11:16:05.1224643
  3. Checking transient database A [dbo].[DonorManagementLogs]

Result: There are entities in the [dbo].[DonorManagementLogs] table. The donors have LastUpdateDateTime set from persistent database Donors.Donors table.✔️

Scenario 2

  1. A donor is created, and an update is sent to Atlas via Azure Storage -> devatlasstorage -> Blob Containers -> donors
  2. The update is processed, there's log in persistent database [Donors].[DonorImportHistory]
  3. Checking transient database A [dbo].[DonorManagementLogs]
    SELECT TOP(1000) dml.*, d.ExternalDonorCode
    FROM dbo.DonorManagementLogs dml
    JOIN dbo.Donors d on d.DonorId = dml.DonorId
    ORDER BY dml.Id DESC

Expected: An entry about updated donor is added with LastUpdatedDateTime - now.

Actual: An entry about updated donor is added with LastUpdatedDateTime "0001-01-01 00:00:00.0000000 +00:00" ❌

Image

$\textcolor{red}{\textsf{Testing failed}}$