[BUG] Duplicate search is not thread safe

ennoborg commented 3 years ago

8.1.2.0

Duplicate search quits at 99 % In article https://www.tamurajones.net/DuplicateCheckingWithFTAnalyzer.xhtml Tamura Jones describes an unhandled exception when searching for duplicates in a very large GEDCOM. Code inspection reveals that this is probably caused by duplicates.Add() being called from different threads.

Reason to think so is that the exception shown in the attached screenshot can only occur when a null element is found in the duplicates list, and since these elements are known to be valid when they are added, and the adding occurs in two tasks, the most probable cause for null elements to be found is corruption of that list.

For more information, please read https://stackoverflow.com/questions/5589315/list-add-thread-safety

To Reproduce Steps to reproduce the behavior:

Load a very large GEDCOM
Click on Errors/Fixes, Duplicates?
Wait a few hours for the duplicate search to complete
See error, or not. This is just taking chances.

Expected behavior The duplicate search completes without errors.

Screenshots afbeelding

Additional context The Stack Overflow suggests several ways to add items in a safe way.

ShammyLevva commented 3 years ago

Thanks for the report. Another wildly negative Tamura rant, moans without ever actually bothering to raise issues to highlight suggestions for improvements. It's only because thoughtful users like yourself want to see the program improve that issues are raised.

I program for the learning experience and the mental challenge rather than being a professional developer. So whilst I'm aware of advantages of threading I'm not fully up to speed on drawbacks and fall into these sorts of easy to forget about traps.

Optimisation is another area where I know the program to do with a lot of work but it's one of those things that takes time to get round to doing and is an incredibly tedious thing to debug. So it gets back burnered. Which means people like Tamura that have enormous trees (although I've got examples of far bigger trees people have supplied) find that processing times for some things take forever.

I'd get round to looking at them if they bothered to actually positively engage rather than just snipe from the sidelines and negatively critisise. Programs don't improve if people don't engage and report issues, so thanks for reporting this.

The core issue as you identifed is the list I've changed this to use a thread safe ConcurrentBag for the processing and convert to a sortable list for the display. I'd caught the exception if someone cancelled and not noticed I'd not caught other exceptions hence the crash. It should now gracefully finish if it errors prematurely.

ShammyLevva commented 3 years ago

I've replied on the thread on Facebook. My main critisism is he never once reached out to say there was a problem nor ask if I could fix the issue. Had he done that I could have looked noticed the exception catch issue and issued a fix for them there and then. That would have saved him hours of frustration.

I appreciate that software developers don't often respond, I do try to so that users don't have these frustrations. However I CANNOT fix what I'm not informed about. My frustration and anger was at being critisised with no attempt to put things right. Not that he critisised the program, they were valid critisims. If only he'd bothered to reach out he'd have saved himself and myself a load of heartache.

Your observations have been extremely helpful. The exception throw was just a careless mistake, the thread safe handling was an error of knowledge, the processing inefficiencies you note in the Facebook thread are down to erring on the side that would work efficiently for most people with small trees. You are absolutely right though it needs a rework to be more efficient.

I am especially interested in your comment re: jumping the list and my method not working. Can you explain if I understood what you are suggesting is the issue correctly. I believe you mean it's reading the whole list each time rather than jumping to the index so is exponentially bad?

ennoborg commented 3 years ago

When it comes to thread safety, my knowledge goes back to the 80's, when I learned about operating systems, in theory, and in practice. That's when I learned that even with cooperative multi-tasking, like in old Windows versions, things can go wrong. I was working with C at the time, and now that we have multi-threading built-in, like with C#, I'm even more careful.

You can see my hack to speed up the search in this commit on my fork:

https://github.com/ennoborg/FTAnalyzer.Shared/commit/8175a4359298ab5bdfafc390e9de32ac097d947b

Note that there's still a duplicates.Add in that, because I simply removed the tasks, instead of switching to the ConcurrentBag, and sent that code to Tamura for testing. I did that, because that was the fastest, and I had other things that day, and partly because Tamura thought that it would be nice to have a single list, instead of males and females, so that people with different gender and matching names and dates can be detected as duplicates, so that you can spot mistakes made on data entry, which can indeed be useful with studies like his.

Like I wrote on Facebook, for small trees the biggest speed-up comes from moving the cancel and progress code to the outer loop, which results in an enormous reduction in the amount of calls to the ThrowIfCancellationRequested code, and the progress updates, which in my code are only done for 1 in 10 persons. You may of course tune that to make the code more responsive, but personally, I see no rationale for that code in the inner loop, since I assume that a duplicate search for one person is fast enough to detect a cancellation in a reasonable time.

Please note that I work like a hacker, meaning that I try to repair things when broken, and also try to break things for fun, but don't have the stamina to create a perfect UI. I had to do the latter for my job, but currently, I'm in early retirement.

ShammyLevva commented 3 years ago

I think the biggest discovery from your tweaks is that for some mind numbingly stupid reason I put the progress bar checking inside the inner loop. Moving that to the outer loop will be orders of magnitude faster. It's just so blindingly obvious. Thanks. Indeed the Performance profiler suggested that my test file of 4.6 million individuals spent 24% of it's time calculating the progress bar percentage which is utterly stupid.

The main issue is that it was only late last year that a helpful user provided his file of 4.6 million individuals. So prior to that the biggest I had was around the 40k individuals mark. Clearly there are going to be dramatic differences in performance when there are 100 times more data in the larger file.

I've also noted that changing the the progress bar updating to only updating when the percentage has actually changed rather than every few records will speed things up again too.

I had a brief discussion with my lecturer mate and he's going to have a look at the code this evening. His initial suggestion is to create a list of comparable keys and then pass that to a large task thread scheduling routine that splits the job up automatically based on the number of processor cores the machine has. This means instead of the arbitrary male/female split it will do it's own split. So on my new machine with 24 cores I'll get 24 threads running at 4.7Ghz rather than 2 threads. So that will make a substantial difference too. I may have to throttle performance though to say 60-80% of CPU so it doesn't entirely hog a machine. Obviously thread safety there will be a huge issue.

He will also look at the string comparison performance to see if that can be optimised or checks made more efficient to fail (ie: ignore as not a duplicate) sooner.

So hopefully we can get things made dramatically more efficient.

Tamura has reached out to me but was, as you say, still very blunt. I have no problem with criticism if it's constructive (that does NOT mean positive). Your criticism for instance was negative by pointing out gaping flaws in my code but was constructive in how it was presented offering suggestions. My main concern is I cannot fix a problem I am unaware of and I similarly cannot read a blog I am unaware of, he seemed to be suggesting that I should have been aware of his blog (I've no idea who he is) or should have Googled his blog to find it. Which makes no sense, should I be spending every day looking for people I've never heard of making comments? So I got angry and frustrated and lashed out. as I though he was another random on the internet posting "see I've found a problem, this software's crap, I'm so clever" like sadly so many people do these days, making posts with no engagement.

Hopefully I can fix the issues and demonstrate I'm happy to engage with users and address their concerns so that the product is better for everyone. He's made some excellent suggestions and I'd rather implement things and fix his issues than have an unhappy user. However it's a two way street, I cannot help if I don't have the info to help.

ennoborg commented 3 years ago

I just did some reading in 4 year old course material that I have for Parallel Computing, and found that there may be quite a simple solution, which is the Parallel For, see

https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-write-a-simple-parallel-for-loop

I never tried this myself, probably because there was too much info for me in that course on one single day, but when I look at the examples on the Microsoft site, it should be easy to test, when you translate the outer loop to a Parallel.For. Most of the code is already thread safe, now that you use the ConcurrentBag for the duplicates, and the only other change that you need to make is to protect the progress counter, in the same manner as is done for the totalSize in the Directory size example. You need to do that, because otherwise, the progress bar will not always reach 100 %.

ennoborg commented 3 years ago

Although there's also a Parallel.Foreach, my hacks for that were unsuccesful, so I held on to the Parallel.For, and made this:

https://github.com/ennoborg/FTAnalyzer.Shared/commit/dcec39abc53b4d8b7feb516d986fac8fa8c4d011

With that, a full duplicate check for my personal tree, with 12,996 persons, takes less than 2 seconds. And that's in Debug mode, inside VirtualBox, with 4 GB RAM, and 4 of 8 cores of my i7. The other 4 GB and 4 cores are for the host, Linux Mint.

ShammyLevva commented 3 years ago

Separate to your efforts I refactored the duplicates it now uses multi parallelism at the individual level utilising the fact that two individuals with different soundex surnames would never be candidate matches. This massively, by several orders of magnitude, reduces the compute time. It also means performance jumps on machines with more cores available. So it was a great test of my new 24 core machine.

I’ve a meeting tomorrow so had to head to bed but committed the work I’d done to date. It needs more testing and tidy up and in particular the display code takes time on large datasets but still shows the calculation completed percentage at 100%. So it needs to refresh the progress bar if the number of displayed records is large.

The 4.6million records file took under 20 mins to process compared with 7 hours before. My own file of around 9000 individuals completes instantaneously.

It still needs UI work too, the cancel just finishes rather than displaying what has been found so far and there the twins ignore option to implement. I’ll have another look over the weekend.

ennoborg commented 3 years ago

I don't know whether you like this as an addition to this issue, but because it's about duplicates, I mention it here:

When the program is configured to use CHR or BAPM facts if BIRT data are not available, I can see that these facts are taken into account for persons born before French rule, but the dates used for comparison don't show in the duplicates list. I can see them when I double click a row in the duplicates list, in the person details, but prefer to see them in the list too.

In Gramps, these facts are shown as Italic under such circumstances, but I don't know if that sort of formatting is available in your duplicates list.

ShammyLevva commented 3 years ago

Sorry I'm not sure I understand "before French rule"?

Do you mean the birth date displayed for duplicate individuals is unknown when it should be the CHR or BAPM date?

BTW have you tried my latest build it's faster by a massive margin.

ennoborg commented 3 years ago

Before French rule is before Napoleon took over our government, and introduced the civil registry, around 1810. And in my tree that means that before that time, my sources are church books with baptisms/christenings and burials, and sometimes a bith date added as a note.

And yes, that's what I mean. When I select duplicates found in the 18th century, I can see that they have a valid CHR or BAPM, but the birth date is still shown as UNKNOWN in the overview. And I want to see the date that was actually used in the process.

I don't see a new build in the Windows store. Do you mean that I must pull a new version from your repositories and build that myself? I haven't done that in days, because I run most experiments in my own fork.

ennoborg commented 3 years ago

I'm having a build problem with your latest code, in the ColumnDetails class.

ShammyLevva commented 3 years ago

I’ve not released a new build yet as there is still outstanding issues eg: the baptism date one. So yes I meant grab upstream code to a branch.

What’s the issue in ColumnDetails class? Did I forget to commit the changes in the shared project?

ennoborg commented 3 years ago

The current build is ok, but I'm seeing null ref exceptions caused by mismatching column names.

ShammyLevva commented 3 years ago

I had a weekly online gaming session with mates at 8pm UK so committed what I'd done so far. Is there a particular class that you noticed a problem in.

The classes should build names based on properties so not sure where it's mismatching. I did notice that Visual studio's winforms designer seemed to go rogue adding text box cells so there's probably something that needs set for when the designer calls the constructor rather than the runtime calling the constructor. I've not investigated what is required for that so it's current state is very much in flux.

ennoborg commented 3 years ago

Today, it seems to be solved by a Clean Rebuild, but when double clicking a row in the duplicates list, I get another exception, on

    void DgDuplicates_CellDoubleClick(object sender, DataGridViewCellEventArgs e)
    {
        if (pbDuplicates.Visible || e.RowIndex < 0 || e.ColumnIndex < 0)
            return; // do nothing if progress bar still visible
        string indA_ID = (string)dgDuplicates.CurrentRow.Cells["DuplicateIndividualID"].Value;
        string indB_ID = (string)dgDuplicates.CurrentRow.Cells["MatchIndividualID"].Value;

in the MainForm class. It says that the value for DuplicateIndividualID can't be found, and when I click Continue in de the debugger, the program exits.

ShammyLevva commented 3 years ago

Thanks yes checking all the event handlers was on my list of things to do.

The core change here is from using data bound grids to using virtually bound grids. To do that programmatically I've used the class attributes of the display interfaces rather than hard coding the field names. So I need to revisit the old event handlers to change the cell column names to use the property name rather than a hard coded one.

This change is separate - but related - to the parallelising of the duplicates. What I found on the enormous database was that it finished the duplicate scan very quick then took ages to display all the records, when actually all it needed to do was to display the visible records. Thus I changed from using fixed displays to virtual. This will benefit not only display handling but memory handling of large files.

Downside is that making the change breaks a lot of things until they have all been gone through and tested.

ShammyLevva commented 3 years ago

Fixed that type of error for grids already done, also fixed the designer load issue, but there are other event errors - eg: now grids are unbound the cell double clicks will fail when trying to read the databound item. Plus there are 11 more grids to make virtual so a good bit of tweaking and checking and re-checking before it's back to being somewhat stable.

ennoborg commented 3 years ago

I just found another thing that can be improved, now that you don't have separate males and females list. I found a couple named something like Mr. & Mrs. John Smith, which scored pretty high, because the names were (almost) identical, and the duplicate list did not show their gender, nor did the details windows, so I had to check my tree in Gramps to see what type of duplicate this was.

So, if you have room, please add gender columns to the duplicate list.

ShammyLevva commented 3 years ago

Gender added to display. Surnames changed to virtual report. Still a lot of tweaks and checks needed to verify all functionality of the event handlers for these grids, in particular the double clicks that open windows will likely fail. Too tired tonight though so going to watch some telly and head to bed.

ShammyLevva commented 2 years ago

Should be fixed for v8.4.0.0. Thanks so much for all the help in identifying issues with duplicates.

ShammyLevva / FTAnalyzer

[BUG] Duplicate search is not thread safe #219