Closed kessa closed 3 years ago
Update:
In case it helps, I've just checked the headers from 1 of the duplicates and have provided them below.
From - Mon Feb 22 16:21:10 2021 X-Account-Key: account5 X-UIDL: AD/1LuJ9cFciX6RMOw5IQKTO+Zo X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 X-Mozilla-Keys: Received: from 10.196.217.15 by atlas103.aol.mail.bf1.yahoo.com with HTTP; Thu, 5 Nov 2020 19:02:18 +0000
and....
From - Mon Feb 22 14:27:22 2021 X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 X-Mozilla-Keys: Received: from 10.196.217.15 by atlas103.aol.mail.bf1.yahoo.com with HTTP; Thu, 5 Nov 2020 19:02:18 +0000
It looks like "Received: from" date & time (i.e. Thu 5th Nov 2020 at 19:02) matches and therefore should find a duplicate.
However, there's an additional/different "From" date and time (at the very start of each header) which doesn't match, so perhaps that's the source of the problem?
You are likely in the very common situation of seeing "almost-dupes".
Start gradually removing comparison criteria until dupes are found. A common culprit is number of lines, as sometimes messages are downloaded with an extra empty line at the bottom.
Hi @eyalroz
Yes, that seems to be the case.
Having done more research, I think part of the problem is that some have (as you rightly say) gained an extra line or two.
In other cases, it seems we've somehow managed to gain/download plain text versions of emails which we'd previously only had in HTML format.
Would it be possible to add a new feature where you can select if you want to keep/delete an email based on whether it's plain text or HTML?
(I've tried to do this by sorting based on file-size/line number (using the assumption that HTML would normally be larger) but weirdly in some cases that's not always true that the HTML is the larger.)
Thanks :-)
Would it be possible to add a new feature where you can select if you want to keep/delete an email based on whether it's plain text or HTML?
Please open a new issue about that.
I recently needed to run a massive de-duplication of emails (Over 16k duplicates in 20k emails due to a technical issue). FYI: It's a newly installed version of Thunderbird (v78.7) running on Windows 10.
The 1st attempt worked brilliantly and did a good job of clearing out the duplicates.
I then needed to re-download a few emails which were still being held on our ISPs servers. I knew this would re-introduce a few duplicates, but I wasn't concerned as the 1st clean-up had gone so well.
However, when trying to run the de-duplication a 2nd time (to remove these newly downloaded duplicates), it said there were "No duplicates found".
This is odd as I can literally see them on screen (See the screenshot below), and haven't changed any settings since the last scan was run.
I tried restarting Thunderbird just in case it was a glitch, but the add-on still says it can't find any duplicates... even though I'm looking directly at them.
Everything is exactly the same, including the time.
Any ideas?