eyalroz / removedupes

Remove Duplicate Messages
https://addons.thunderbird.net/en-US/thunderbird/addon/removedupes/
Other
87 stars 6 forks source link

The `Body` comparison fails to find duplicates. #202

Open iago-lito opened 11 months ago

iago-lito commented 11 months ago

I was suprised that No duplicates were found pretty much in any situation involving the Body comparison criterium. So I closed Thunderbird and went under its Mail folder, grabbed some raw archive mbox file and tried the following:

$ cat archive > dupes && cat archives >> dupes

So, this artefactual dupes mail file is twice the size of archive and contains only duplicates, right?

Opening Thunderbird again, right-clicking on the new dupes mail folder and searching duplicates yielded No duplicates were found. again. I therefore suspect there is a bug in the Body comparison.

eyalroz commented 11 months ago

So, this artefactual dupes mail file is twice the size of archive and contains only duplicates, right?

You're assuming TB properly recognizes the duplicated messages + meta-data as two distinct messages. That may not be the case. Also, TB may be failing to retrieve the message bodies properly when you manipulate mbox files like that.

Still, if you can send me a compressed mbox file you've generated this way (via email or even here with an attachment), with 2x2 messages, which is supposed to have 2 dupe sets of size 2, but is not found to have them - I could try to reproduce and work on a fix.

Please note that my availability under late this month is rather low.

Did you remove all other criteria?

iago-lito commented 11 months ago

Did you remove all other criteria?

Not when I wrote the OP, but I have tested now with only Body selected and the same happens indeed.

You're assuming TB properly recognizes the duplicated messages + meta-data as two distinct messages. That may not be the case. Also, TB may be failing to retrieve the message bodies properly when you manipulate mbox files like that.

FWIU, mbox files are just text files containing all messages in a ^From-separated sequence, so I think it does make sense to concatenate two files like this. I was also convinced when I saw that TB correctly interpreted the result.

if you can send me a compressed mbox file you've generated this way

There you go. This is not compressed but very small. I have crafted a toy example from only two dummy messages. The second file is just the concatenation of twice the first file so it contains no extra information. This is a rather minimal example that I have been able to reproduce the bug with:

I would be happy that these two urls not linger on online for too long. Can you please tell me when you have the files on your side so I can remove them?

Please note that my availability under late this month is rather low.

No worries, thank you for removedupes <3

eyalroz commented 11 months ago

I'll try to find time to look at this next week; if I haven't please poke me again. With work, plus anti-war activities, plus other repositories of mine (cuda-api-wrappers) - I'm kind of swamped.

iago-lito commented 11 months ago

Take your time :) Do you have the the files on your side so I can take them offline?

iago-lito commented 7 months ago

Friendly ping @eyalroz, but maybe you're not out of the swamp yet..

lkasdj9 commented 5 months ago

Joining iago-lito. same issue for a while now (115). another friendly ping @eyalroz