cloudveiltech / Filter-Windows

HTTP/S Content Filter for Windows 7 and newer
Mozilla Public License 2.0
9 stars 13 forks source link

Improvements to text trigger scan performance #144

Closed kfreezen closed 6 years ago

kfreezen commented 6 years ago

1) All contents inside <script></script> and <style></style> tags are now stripped and are not scanned. 2) All opening and closing tags are stripped from the token list with the exception of alt, title, and href attributes. 3) Added a second table to the Sqlite database. This table keeps the first word of every trigger loaded. 4) Changed algorithm to scan for multi-word triggers only as necessary.

TechnikEmpire commented 6 years ago

Was something wrong with the FastHtmlTextExtractor that already did all of this text extraction? Also there was already a policy based system for configuring scanning window size which is where the performance bottleneck is. Also I'm curious how 2 queries per trigger lookup is 3x faster than 1 query. Lastly I saw a commit either here or in the merge of 145 that will stop json from being filtered.

TechnikEmpire commented 6 years ago

Oh yeah and the performance impact you guys are seeing has nothing to do with anything but a current bug that's open with Microsoft for the HttpClient class in .NET Standard, which is used in the engine for the upstream connections. In many cases connections and parsing of HTML responses can take several times longer than .NET full's version. Plus there's also some impact with built in Microsoft telemetry that needs to be disabled that is related to this issue. It's all slated to be fixed for the 2.1 release.