IARC-CSU / CanReg5

CanReg5 is a multi user, multi platform, open source tool to input, store, check and analyse cancer registry data.
http://www.iacr.com.fr/CanReg5
GNU General Public License v3.0
26 stars 13 forks source link

C202304 - Deduplication search performance improvements #126

Open rlichainfotel opened 1 year ago

rlichainfotel commented 1 year ago

There exists a matching algorithm using weights on different variables to establish a matching score between multiple (not-exactly) duplicate records. For example, Soundex is used for name variables.

The objective is to improve the algorithm performance:

Be mindful to have measurements before and after improvements to demonstrate the progress.

wcheninfotel commented 1 year ago

The branch is : https://github.com/infotel4iarc/CanReg5/tree/C202304_duplication_search

A script to generate random records is created in order to populate the database with a huge amount of record. The huge amount of records will make the difference of execution time easily distinguishable, which will make the time comparasion easier

A check box is added to the search variable panel to lock the variable during duplication search.

wcheninfotel commented 1 year ago

The blocking feature is now functional for person search, the search variables can be changed in tools -> database structure. However, a restart is necessary to make the modification effectif. The blocking feature will keep only records matches exactly the orignal record's value for the blocked variable.

An inconvenience was noticed during the test:

cchenginfotel commented 1 year ago

The issue with the loading modal not being displayed when the user clicks the "person search" button has been fixed. The action to trigger both the search and the waitFrame caused the problem. The waitFrame could not be displayed before executing the database search because everything is happening within an Action. As "showing the waitFrame" is also an action, the latter was added to the execution queue after the action to search the duplicates. Finally, the waitFrame shows up right after the search is completed and disappears instantly.

Using a SwingWorker solved the issue since SwingWorkers can be launched in a background thread simultaneously with the "person search" action.

cchenginfotel commented 1 year ago

Update: The loading frame is now closed before displaying the results of the PersonSearch refactoring had to be done so split the runPersonSearch() method in two + update of the java documentation

cchenginfotel commented 1 year ago

Update: It is now possible to specify a margin of error on year on the date of birth. Instead of having a default range set to 1 year on unblocked personSearch, the user can select the range of error himself. If the date has been blocked, only the dates around the selected date will be fetched from the database.

cchenginfotel commented 1 year ago

updated: