ArniDagur / python-adblock

Brave's adblock library in Python
https://pypi.org/project/adblock/
Apache License 2.0
58 stars 4 forks source link

Loading adblock cache takes 2-4s with huge cache file #62

Open The-Compiler opened 2 years ago

The-Compiler commented 2 years ago

Before I explain the issue, let me note that I'm not sure if this is the right place - chances are, the answer is just "don't do that...", or perhaps this is something that can be improved somehow in the adblocking library rather than this wrapper. However, I lack the Rust knowledge to properly report it there, and I'd like to hear your opinion on this first.

Apparently, there have been posts suggesting to add many more filter lists to qutebrowser, namely:

c.content.blocking.adblock.lists = [
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/annoyances.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/badlists.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/badware.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters-2020.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters-2021.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/privacy.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/resource-abuse.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/thirdparties/easylist-downloads.adblockplus.org/easyprivacy.txt',
    'https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/thirdparties/pgl.yoyo.org/as/serverlist',
    'https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/fakenews-gambling/hosts',
    'https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt',
    'https://fanboy.co.nz/fanboy-problematic-sites.txt',
    'https://easylist.to/easylist/easylist.txt',
    'https://raw.githubusercontent.com/bogachenko/fuckfuckadblock/master/fuckfuckadblock.txt'
]

It looks like there are people who blindly copy that, because more clearly must be better or something...

However, running :adblock-update with those lists results in a adblock-cache.dat which is around 130 MB, and qutebrowser hangs about 2-4s at startup with it (when calling self._engine.deserialize_from_file). Some questions/ideas:

ArniDagur commented 2 years ago

Do you know if the performance has always been this bad for large blocklists, or if it only started happening after the recent 5.2 update? In the latter case, I may have an inkling of what is wrong.

The-Compiler commented 2 years ago

Looks like it takes a similar time on 0.5.0 too for me.

ArniDagur commented 2 years ago

Ah, I think I know why. The underlying adblock library used to compress the cache file with brotli, but doesn't do that anymore. I guess I should fix that by compressing with brotli in python-adblock. Alternatively, qutebrowser could use serialize (as opposed to serialize_to_file), and do its own compression using zlib.

The-Compiler commented 2 years ago

Both would be fine by me. Though from reading https://github.com/brave/adblock-rust/commit/85063aa56c11a8c4d6b7f99abc55d1a9ae4b7002 (which was then reverted in https://github.com/brave/adblock-rust/commit/d61c0e53a798042531429cea39b51e0e78f94a54 it should still be doing gzip compression, no?