Inconsistent behaviour when trying to search for duplicates

kessa commented 3 years ago

I recently needed to run a massive de-duplication of emails (Over 16k duplicates in 20k emails due to a technical issue). FYI: It's a newly installed version of Thunderbird (v78.7) running on Windows 10.

The 1st attempt worked brilliantly and did a good job of clearing out the duplicates.

I then needed to re-download a few emails which were still being held on our ISPs servers. I knew this would re-introduce a few duplicates, but I wasn't concerned as the 1st clean-up had gone so well.

However, when trying to run the de-duplication a 2nd time (to remove these newly downloaded duplicates), it said there were "No duplicates found".

This is odd as I can literally see them on screen (See the screenshot below), and haven't changed any settings since the last scan was run.

I tried restarting Thunderbird just in case it was a glitch, but the add-on still says it can't find any duplicates... even though I'm looking directly at them.

Everything is exactly the same, including the time.

Any ideas?

kessa commented 3 years ago

Update:

In case it helps, I've just checked the headers from 1 of the duplicates and have provided them below.

From - Mon Feb 22 16:21:10 2021 X-Account-Key: account5 X-UIDL: AD/1LuJ9cFciX6RMOw5IQKTO+Zo X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 X-Mozilla-Keys: Received: from 10.196.217.15 by atlas103.aol.mail.bf1.yahoo.com with HTTP; Thu, 5 Nov 2020 19:02:18 +0000

and....

From - Mon Feb 22 14:27:22 2021 X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 X-Mozilla-Keys: Received: from 10.196.217.15 by atlas103.aol.mail.bf1.yahoo.com with HTTP; Thu, 5 Nov 2020 19:02:18 +0000

It looks like "Received: from" date & time (i.e. Thu 5th Nov 2020 at 19:02) matches and therefore should find a duplicate.

However, there's an additional/different "From" date and time (at the very start of each header) which doesn't match, so perhaps that's the source of the problem?

eyalroz commented 3 years ago

You are likely in the very common situation of seeing "almost-dupes".

Start gradually removing comparison criteria until dupes are found. A common culprit is number of lines, as sometimes messages are downloaded with an extra empty line at the bottom.

kessa commented 3 years ago

Hi @eyalroz

Yes, that seems to be the case.

Having done more research, I think part of the problem is that some have (as you rightly say) gained an extra line or two.

In other cases, it seems we've somehow managed to gain/download plain text versions of emails which we'd previously only had in HTML format.

Would it be possible to add a new feature where you can select if you want to keep/delete an email based on whether it's plain text or HTML?

(I've tried to do this by sorting based on file-size/line number (using the assumption that HTML would normally be larger) but weirdly in some cases that's not always true that the HTML is the larger.)

Thanks :-)

eyalroz commented 3 years ago

Would it be possible to add a new feature where you can select if you want to keep/delete an email based on whether it's plain text or HTML?

Please open a new issue about that.

eyalroz / removedupes

Inconsistent behaviour when trying to search for duplicates #43