brave-experiments / ad-block

Ad block engine used in the Brave browser for ABP filter syntax based lists like EasyList.
https://www.brave.com
Mozilla Public License 2.0
239 stars 95 forks source link

About AdBlockClient::parse() function execution time #158

Closed ymh8416 closed 5 years ago

ymh8416 commented 5 years ago

edit bbondy

See below, parse is not meant to be used from clients, just from server CI: https://github.com/brave/ad-block/issues/158#issuecomment-464350923


I used your ad-block project in my android project. I found that using AdBlockClient::parse() to parse a 3M Abp rule file takes 80 seconds. Can this be optimized? thanks very much

daoan1412 commented 5 years ago

I think u can create a background service to parse data.

daoan1412 commented 5 years ago

Instead create a background service. You can parse data, serialize data to a file then deserialize data to match faster.

#include "ad_block_client.h"
#include <algorithm>
#include <iostream>
#include <fstream>
#include <sstream>
#include <iostream>
#include <string>

using namespace std;

string getFileContents(const char *filename)
{
  ifstream in(filename, ios::in);
  if (in) {
    ostringstream contents;
    contents << in.rdbuf();
    in.close();
    return(contents.str());
  }
  throw(errno);
}

void writeFile(const char *filename, const char *buffer, int length)
{
  ofstream outFile(filename, ios::out | ios::binary);
  if (outFile) {
    outFile.write(buffer, length);
    outFile.close();
    return;
  }
  throw(errno);
}

int main(int argc, char**argv) {
  std::string &&easyListTxt = getFileContents("./test/data/easylist.txt");
  const char *urlsToCheck[] = {
    // ||pagead2.googlesyndication.com^$~object-subrequest
    "http://pagead2.googlesyndication.com/pagead/show_ads.js",
    // Should be blocked by: ||googlesyndication.com/safeframe/$third-party
    "http://tpc.googlesyndication.com/safeframe/1-0-2/html/container.html",
    // Should be blocked by: ||googletagservices.com/tag/js/gpt_$third-party
    "http://www.googletagservices.com/tag/js/gpt_mobile.js",
    // Shouldn't be blocked
    "http://www.brianbondy.com"
  };

  // This is the site who's URLs are being checked, not the domain of the URL being checked.
  const char *currentPageDomain = "slashdot.org";

  // Parse easylist
  AdBlockClient client;
  client.parse(easyListTxt.c_str());

  // Do the checks
  std::for_each(urlsToCheck, urlsToCheck + sizeof(urlsToCheck) / sizeof(urlsToCheck[0]), [&client, currentPageDomain](std::string const &urlToCheck) {
    if (client.matches(urlToCheck.c_str(), FONoFilterOption, currentPageDomain)) {
      cout << urlToCheck << ": You should block this URL!" << endl;
    } else {
      cout << urlToCheck << ": You should NOT block this URL!" << endl;
    }
  });

  int size;
  // This buffer is allocate on the heap, you must call delete[] when you're done using it.
  char *buffer = client.serialize(size);
  writeFile("./ABPFilterParserData.dat", buffer, size);

  AdBlockClient client2;
  // Deserialize uses the buffer directly for subsequent matches, do not free until all matches are done.
  client2.deserialize(buffer);
  // Prints the same as client.matches would
  std::for_each(urlsToCheck, urlsToCheck + sizeof(urlsToCheck) / sizeof(urlsToCheck[0]), [&client2, currentPageDomain](std::string const &urlToCheck) {
    if (client2.matches(urlToCheck.c_str(), FONoFilterOption, currentPageDomain)) {
      cout << urlToCheck << ": You should block this URL!" << endl;
    } else {
      cout << urlToCheck << ": You should NOT block this URL!" << endl;
    }
  });
  delete[] buffer;
  return 0;
}
ymh8416 commented 5 years ago

I am very sorry, I understand what you mean. I am now in the background to execute AdBlockClient::parse(), but 80 seconds after the program starts, the ad blocking will start working normally. This time is a bit long.

bbondy commented 5 years ago

Note that parse intentionally does as much work as possible. Clients are not meant to parse lists directly. They should instead serialize an already parsed list and then the clients should use the deserialized list.

ymh8416 commented 5 years ago

Note that parse intentionally does as much work as possible. Clients are not meant to parse lists directly. They should instead serialize an already parsed list and then the clients should use the deserialized list.

Thank you very much, I understand.