Improve list parsing error handling

0xERR0R commented 1 year ago

Currently, if a single list file (on inline definition) contains more than 5 errors (list_cache.go#maxErrorsPerFile), the parser will stop the import process. It means, the file is partially in cache (all entries parsed until the 5th error is occurred).

I think the idea was to stop processing if the input file contains only "garbage" (for example user references HTML page instead of a plain text file).

If user imports big files (> 1M entries), 5 errors as threshold can be reached very quickly. In this case, only a part of the file is in the cache and user may not notice it.

I think we can improve it, some ideas here:

Threshold value should be dependent on the file line count (maybe as percentage? 5% of row count, but min. 5?
Log all errors for a single list together as summary like: List XXX import finished in XX ms, XXX rows imported, XXX rows ignored, erros: (XXX list of errors). Currently we are logging each error as a single log entry
New Prometheus metrics for ignored/malformed import list rows. User can define alerts/monitor it

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

MaKeG0 commented 1 year ago

Hello,

I've encountered a similar issue due to the current parsing error limitation in Blocky. I'm running Blocky in a container within my router, following the setup guide provided here: https://xaizone.eu/post/setup-blocky-on-mikrotik-routeros/

The setup works well, and I've successfully integrated blocklists from the following sources:

However, I've run into problems when trying to incorporate lists from these sources:

These lists were found on the uBlock project page: https://github.com/gorhill/uBlock

The issue seems to stem from the parsing error limit in Blocky. If the maximum number of allowed parsing errors could be increased, or even better, made configurable, it would likely resolve this issue.

Additionally, it would be beneficial to have a default parsing mechanism that skips HTML tags. This would be particularly useful for lists generated by PHP pages, which often include HTML tags that contribute to parsing errors.

I believe these enhancements would significantly improve the flexibility and robustness of Blocky. I look forward to hearing your thoughts on this.

ThinkChaos commented 1 year ago

It is configurable in the dev version already, next release will have it :)
See #986.

For easylist.txt, it is in adblock format which is not supported by blocky ATM. See #971 and #950.

For the second one you're using a HTML page, which I guess you know based on the rest of the comment. There's a plaintext version available though: Using the plaintext version of the hosts format (link) works fine for me.
Ideally you'd use the version that's not in hosts format but just a plain domain list cause that avoids having 127.0.0.1 on each line saving download size and memory during list loading. That one also works fine (link).

Additionally, it would be beneficial to have a default parsing mechanism that skips HTML tags. This would be particularly useful for lists generated by PHP pages, which often include HTML tags that contribute to parsing errors.

If you have any links of HTML only lists please share in a new issue, but I've never seen that personally and AFAIK no other software that uses these lists expects HTML.

MaKeG0 commented 1 year ago

@ThinkChaos I appreciate your help, I totally missed the plain text box option, I guess I was too focused on other parts of the page. I'm glad to hear that the improvements are in progress and already committed, that's awesome news for me and the project.

Keep up the good work!

0xERR0R / blocky

Improve list parsing error handling #966