daijro / browserforge

🎭 Intelligent browser header & fingerprint generator
https://pypi.org/project/browserforge
Apache License 2.0
190 stars 12 forks source link

Make orjson optional #6

Closed deedy5 closed 5 months ago

deedy5 commented 5 months ago

Can you please make orjson optional?

daijro commented 5 months ago

Would it be better if I use msgspec instead? It seems to benchmark faster than orjson, and supports much more platforms, including android arm64.

deedy5 commented 5 months ago

msgspec makes sense to use if you need to validate data or get certain fields from json. But for that you need to write a data structure msgspec.Struct. In general, it doesn't make sense in our case, msgspec won't be faster than orjson.

There are some tests here, you can try to test the performance on downloaded json files: https://gist.github.com/jcrist/80b84817e9c53a63222bd905aa607b43

daijro commented 5 months ago

Msgspec can also be used for dictionary encoding without msg.Struct with msgspec.json.decode. Here is my benchmark for reference:

Benchmarks:
- stdlib json: 0.60 ± 0.03
- orjson: 0.29 ± 0.03
- simdjson: 0.49 ± 0.03
- msgspec-dict: 0.27 ± 0.01  # Not using msg.Struct
- msgspec-struct: 0.21 ± 0.01  # Using msg.Struct
- pydantic-v2: 1.82 ± 0.51

BrowserForge's runtime speed relies pretty heavily on json encoding/decoding, so I worry that requiring users to install a specific extra like pip install browserforge[orjson] can add an extra layer of complexity for something so crucial.

Would you be open to considering an alternative to orjson?

deedy5 commented 5 months ago

HeaderGenerator

from browserforge.headers import HeaderGenerator
headers = HeaderGenerator()
headers.generate()

orjson

Total time: 0.00861115 s
File: /browserforge/browserforge/bayesian_network.py
Function: extract_json at line 269

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   269                                           @profile
   270                                           def extract_json(path: Path) -> dict:
   271                                               """
   272                                               Unzips a zip file if the path points to a zip file, otherwise directly loads a JSON file.
   273                                           
   274                                               Parameters:
   275                                                   path: The path to the zip file or JSON file.
   276                                           
   277                                               Returns:
   278                                                   A dictionary representing the JSON content.
   279                                               """
   280         2          0.7      0.4      0.0      network_definition = {}
   281                                           
   282         2         10.5      5.3      0.1      if path.suffix != '.zip':
   283                                                   # Directly load the JSON file
   284                                                   with open(path, 'rb') as file:
   285                                                       return orjson.loads(file.read())
   286                                               # Unzip the file and load the JSON content
   287         4        567.2    141.8      6.6      with zipfile.ZipFile(path, 'r') as zf:
   288         2          4.1      2.0      0.0          for filename in zf.namelist():
   289         2          1.1      0.6      0.0              if filename.endswith('.json'):
   290         4        118.8     29.7      1.4                  with zf.open(filename) as file:
   291         2       7907.4   3953.7     91.8                      network_definition = orjson.loads(file.read())
   292         2          1.0      0.5      0.0                      break  # Assuming only one JSON file is needed
   293                                           
   294         2          0.3      0.1      0.0      return network_definition

json

Total time: 0.0119073 s
File: /browserforge/browserforge/bayesian_network.py
Function: extract_json at line 269

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   269                                           @profile
   270                                           def extract_json(path: Path) -> dict:
   271                                               """
   272                                               Unzips a zip file if the path points to a zip file, otherwise directly loads a JSON file.
   273                                           
   274                                               Parameters:
   275                                                   path: The path to the zip file or JSON file.
   276                                           
   277                                               Returns:
   278                                                   A dictionary representing the JSON content.
   279                                               """
   280         2          0.7      0.4      0.0      network_definition = {}
   281                                           
   282         2         10.7      5.3      0.1      if path.suffix != '.zip':
   283                                                   # Directly load the JSON file
   284                                                   with open(path, 'rb') as file:
   285                                                       return json.loads(file.read())
   286                                               # Unzip the file and load the JSON content
   287         4        736.5    184.1      6.2      with zipfile.ZipFile(path, 'r') as zf:
   288         2          3.7      1.9      0.0          for filename in zf.namelist():
   289         2          1.0      0.5      0.0              if filename.endswith('.json'):
   290         4        108.6     27.2      0.9                  with zf.open(filename) as file:
   291         2      11045.2   5522.6     92.8                      network_definition = json.loads(file.read())
   292         2          0.5      0.3      0.0                      break  # Assuming only one JSON file is needed
   293                                           
   294         2          0.3      0.2      0.0      return network_definition

FingerprintGenerator

from browserforge.fingerprints import FingerprintGenerator
fingerprints = FingerprintGenerator()
fingerprints.generate()

orjson

Total time: 0.0538554 s
File: /browserforge/browserforge/bayesian_network.py
Function: extract_json at line 269

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   269                                           @profile
   270                                           def extract_json(path: Path) -> dict:
   271                                               """
   272                                               Unzips a zip file if the path points to a zip file, otherwise directly loads a JSON file.
   273                                           
   274                                               Parameters:
   275                                                   path: The path to the zip file or JSON file.
   276                                           
   277                                               Returns:
   278                                                   A dictionary representing the JSON content.
   279                                               """
   280         3          1.9      0.6      0.0      network_definition = {}
   281                                           
   282         3         16.4      5.5      0.0      if path.suffix != '.zip':
   283                                                   # Directly load the JSON file
   284                                                   with open(path, 'rb') as file:
   285                                                       return orjson.loads(file.read())
   286                                               # Unzip the file and load the JSON content
   287         6        925.1    154.2      1.7      with zipfile.ZipFile(path, 'r') as zf:
   288         3          6.1      2.0      0.0          for filename in zf.namelist():
   289         3          1.3      0.4      0.0              if filename.endswith('.json'):
   290         6        176.5     29.4      0.3                  with zf.open(filename) as file:
   291         3      52725.6  17575.2     97.9                      network_definition = orjson.loads(file.read())
   292         3          2.0      0.7      0.0                      break  # Assuming only one JSON file is needed
   293                                           
   294         3          0.5      0.1      0.0      return network_definition

json

Total time: 0.118536 s
File: /browserforge/browserforge/bayesian_network.py
Function: extract_json at line 269

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   269                                           @profile
   270                                           def extract_json(path: Path) -> dict:
   271                                               """
   272                                               Unzips a zip file if the path points to a zip file, otherwise directly loads a JSON file.
   273                                           
   274                                               Parameters:
   275                                                   path: The path to the zip file or JSON file.
   276                                           
   277                                               Returns:
   278                                                   A dictionary representing the JSON content.
   279                                               """
   280         3          1.8      0.6      0.0      network_definition = {}
   281                                           
   282         3         16.1      5.4      0.0      if path.suffix != '.zip':
   283                                                   # Directly load the JSON file
   284                                                   with open(path, 'rb') as file:
   285                                                       return json.loads(file.read())
   286                                               # Unzip the file and load the JSON content
   287         6       1163.3    193.9      1.0      with zipfile.ZipFile(path, 'r') as zf:
   288         3          5.6      1.9      0.0          for filename in zf.namelist():
   289         3          1.6      0.5      0.0              if filename.endswith('.json'):
   290         6        166.5     27.8      0.1                  with zf.open(filename) as file:
   291         3     117179.2  39059.7     98.9                      network_definition = json.loads(file.read())
   292         3          1.6      0.5      0.0                      break  # Assuming only one JSON file is needed
   293                                           
   294         3          0.5      0.2      0.0      return network_definition
deedy5 commented 5 months ago

extract_json is executed only once when the class is initialized. In other functions that have json.loads, the execution time is very small and there is almost no difference.

That is, I don't think using even standard json will affect performance because json files are only loaded once when the class is initialized.

~If, of course, you initialize the class every time, the difference is noticeable. I mean, you shouldn't do that:~ The json files are class variables and are loaded once. So even if you do it this way, they are loaded only once.

from browserforge.fingerprints import FingerprintGenerator
for i in range(10):
    FingerprintGenerator().generate()
deedy5 commented 5 months ago

HeadersGenerator - small functions

orjson

Total time: 0.000210845 s
File: /browserforge/browserforge/headers/generator.py
Function: _load_headers_order at line 424

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   424                                               @profile
   425                                               def _load_headers_order(self) -> Dict[str, List[str]]:
   426                                                   """
   427                                                   Loads the headers order from the headers-order.json file.
   428                                           
   429                                                   Returns:
   430                                                       Dict[str, List[str]]: Dictionary of headers order for each browser.
   431                                                   """
   432         1         10.8     10.8      5.1          headers_order_path = DATA_DIR / "headers-order.json"
   433         1        200.0    200.0     94.9          return orjson.loads(headers_order_path.read_bytes())

Total time: 0.000840155 s
File: /browserforge/browserforge/headers/generator.py
Function: _load_unique_browsers at line 434

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   434                                               @profile
   435                                               def _load_unique_browsers(self) -> List[HttpBrowserObject]:
   436                                                   """
   437                                                   Loads the unique browsers from the browser-helper-file.json file.
   438                                           
   439                                                   Returns:
   440                                                       List[HttpBrowserObject]: List of HttpBrowserObject instances.
   441                                                   """
   442         1         33.6     33.6      4.0          browser_helper_path = DATA_DIR / 'browser-helper-file.json'
   443         1        308.7    308.7     36.7          unique_browser_strings = orjson.loads(browser_helper_path.read_bytes())
   444         2        497.8    248.9     59.3          return [
   445                                                       self._prepare_http_browser_object(browser_str)
   446         1          0.1      0.1      0.0              for browser_str in unique_browser_strings
   447                                                       if browser_str != MISSING_VALUE_DATASET_TOKEN
   448                                                   ]

json

Total time: 0.000167574 s
File: /browserforge/browserforge/headers/generator.py
Function: _load_headers_order at line 425

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   425                                               @profile
   426                                               def _load_headers_order(self) -> Dict[str, List[str]]:
   427                                                   """
   428                                                   Loads the headers order from the headers-order.json file.
   429                                           
   430                                                   Returns:
   431                                                       Dict[str, List[str]]: Dictionary of headers order for each browser.
   432                                                   """
   433         1         11.8     11.8      7.0          headers_order_path = DATA_DIR / "headers-order.json"
   434         1        155.8    155.8     93.0          return json.loads(headers_order_path.read_bytes())

Total time: 0.000848061 s
File: /browserforge/browserforge/headers/generator.py
Function: _load_unique_browsers at line 435

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   435                                               @profile
   436                                               def _load_unique_browsers(self) -> List[HttpBrowserObject]:
   437                                                   """
   438                                                   Loads the unique browsers from the browser-helper-file.json file.
   439                                           
   440                                                   Returns:
   441                                                       List[HttpBrowserObject]: List of HttpBrowserObject instances.
   442                                                   """
   443         1         33.4     33.4      3.9          browser_helper_path = DATA_DIR / 'browser-helper-file.json'
   444         1        312.1    312.1     36.8          unique_browser_strings = json.loads(browser_helper_path.read_bytes())
   445         2        502.4    251.2     59.2          return [
   446                                                       self._prepare_http_browser_object(browser_str)
   447         1          0.1      0.1      0.0              for browser_str in unique_browser_strings
   448                                                       if browser_str != MISSING_VALUE_DATASET_TOKEN
   449                                                   ]
daijro commented 5 months ago

That's fair, thanks for the benchmarks! I'll go ahead and make it optional.