Arraying / EmailExporter

See all domains you interacted with at a glance!
1 stars 1 forks source link

Added very (blazingly) fast implementation (๐Ÿฆ€) #3

Open AntoniosBarotsis opened 10 months ago

AntoniosBarotsis commented 10 months ago

Following the extensive discussion outlined in #2, I decided to implement the issue.

Benchmarks

Bit more seriously

There weren't many libraries for parsing mbox files so I did that myself. I tested it in my gmail archive which was 2GB and the only place it falls short compared to the python version is some cases which look like this

From: =?utf-8?Q?=CE=9D=CE=99=CE=9A=CE=9F=CE=9B=CE=91=CE=9F=CE=A3_=CE=9C?=
 =?utf-8?Q?=CE=A0=CE=91=CE=A1=CE=9F=CE=A4=CE=A3=CE=97=CE=A3_=CE=BC=CE?=
 =?utf-8?Q?=AD=CF=83=CF=89_=CF=84=CE=B7=CF=82_=CF=85=CF=80=CE=B7=CF=81?=
 =?utf-8?Q?=CE=B5=CF=83=CE=AF=CE=B1=CF=82_=CF=86=CF=89=CE=BD=CE=B7=CF?=
 =?utf-8?Q?=84=CE=B9=CE=BA=CE=BF=CF=8D_=CF=84=CE=B1=CF=87=CF=85=CE=B4?=
 =?utf-8?Q?=CF=81=CE=BF=CE=BC=CE=B5=CE=AF=CE=BF=CF=85?=
 <sbvmsvc9@microsoft.com>

where I was too lazy to make the parsing work so I just skipped it, in the entire file, there were 2 instances of this (both from Microsoft) so this doesn't seem like a big deal.

On the other hand, somewhat surprisingly, the rust version detected around 20 more emails than the python version somehow. Some make sense and I'm not sure why they aren't in the python version but some straight up just look weird

From: d2lsupport@tudelft.brightspace.com <d2lsupport@tudelft.brightspace.co=
m>

this for example should not be split over multiple lines but here we are, the python version completely ignores it (as it is technically broken) and the rust one prints d2lsupport@tudelft.brightspace.co. I did not read through the RFC long enough to figure out if newlines are permitted in the addr-spec element ๐Ÿ˜Ž

But anyway, the inconsistencies were in 20 out of the 1200 total entries so for the most part it's fine, you might get some extra junk here and there and, except for whatever that long thing Microsoft used, I at least didn't get anything less in my data file. I've also added a test that essentially diffs the python and rust outputs in case you're interested ๐Ÿ‘

AntoniosBarotsis commented 10 months ago

Also to the surprise of no one it was quite a bit faster, 3sec 768ms 800ยตs 700ns (rs) vs 3min 28sec 250ms 390ยตs 500ns (py) though I doubt anyone's student email has gigabytes of data. I used my gmail because you can't export the mbox archive from outlook web.