cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

Faster than a libmagic wrapper? #95

Closed mara004 closed 3 weeks ago

mara004 commented 1 month ago

The Readme claims

Advantages over using a wrapper for 'file' or 'libmagic':

  • Faster

Do you have any actual evidence for that (reproducible benchmark or similar) ? For typically pure-python re-implementations are slower than C library bindings, unless the pure-python package uses significantly more efficient algorithms, or there is a lot of object transfer or FFI overhead involved with the binding.

cdgriffith commented 3 weeks ago

Here's a quick test:

python-magic (libmagic wrapper)

import magic
print(magic.from_buffer("#!/usr/bin/env python"))
$time python speed_test_pm.py
a /usr/bin/env python script, ASCII text executable, with no line terminators

real    0m0.108s
user    0m0.018s
sys     0m0.008s

puremagic

import puremagic
print(puremagic.from_string("#!/usr/bin/env python"))
$ time python speed_test_pure.py
.py

real    0m0.068s
user    0m0.015s
sys     0m0.000s
mara004 commented 3 weeks ago

For one thing, a single invocation isn't exactly reliable. For another, the above always includes import-time tasks, where libmagic is at a disadvantage because it has to locate and load the DLL.

A more reliable benchmark would be needed to actually support the "Faster" claim.

cdgriffith commented 3 weeks ago

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

mara004 commented 3 weeks ago

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

Well, that should be clarified in the Readme (e.g. "Faster to import" rather than just "Faster"). I took it to mean the from_*(...) calls would be claimed faster. 😅 If only importing is supposed to be faster, that will be true, but the primary concern is runtime, not startup time. The 0.04s import-time difference may not be relevant to most users.

cdgriffith commented 3 weeks ago

Yes, I can add that note in the Readme!

cdgriffith commented 3 weeks ago

I did decide to go and just test this further because it was bothering me as I knew this was faster in the past (~10 years ago)

Testing on develop branch for 1.27 using just my computer's downloads folder.

puremagic Test File ```python import time from pathlib import Path import tracemalloc download_files = list(x for x in Path("Downloads").glob("*") if x.is_file()) tracemalloc.start() import_time_start = time.perf_counter() import puremagic print("Import time:", time.perf_counter() - import_time_start) current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 10**6}MB") print(f"\nTesting {len(download_files)} files") download_start_time = time.perf_counter() unknown_results_types = set() unknown_total = 0 for file in download_files: try: puremagic.from_file(file) except puremagic.PureError: unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem) unknown_total += 1 except Exception as e: print(f"Error: {file} - {e}") print("\nDownload file time:", time.perf_counter() - download_start_time) print(f"Unknown results types: {unknown_results_types}") print(f"Unknown total: {unknown_total}") current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 10**6}MB") print(f"\nPeak memory usage: {peak / 10**6}MB") tracemalloc.stop() ```
python-magic Test File ```python import time from pathlib import Path import tracemalloc download_files = list(x for x in Path("Downloads").glob("*") if x.is_file()) tracemalloc.start() import_time_start = time.perf_counter() import magic print("Import time:", time.perf_counter() - import_time_start) current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 10**6}MB") print(f"\nTesting {len(download_files)} files") download_start_time = time.perf_counter() unknown_results_types = set() unknown_total = 0 for file in download_files: try: result = magic.from_file(file) except Exception as e: print(f"Error: {file} - {e}") else: if result in ("ASCII text", "data"): unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem) unknown_total += 1 print("Download file time:", time.perf_counter() - download_start_time) print(f"Unknown results types: {unknown_results_types}") print(f"Unknown total: {unknown_total}") current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 10**6}MB") print(f"\nPeak memory usage: {peak / 10**6}MB") tracemalloc.stop() ````

puremagic results

$ time python speed_test_pure.py

Import time: 0.030981435003923252
Current memory usage: 1.009179MB

Testing 1131 files

Download file time: 2.9169464290025644
Unknown results types: {'.ovpn', '.docx', '.img', 'README'}
Unknown total: 4
Current memory usage: 1.022658MB

Peak memory usage: 1.355427MB

real    0m3.892s
user    0m1.001s
sys     0m0.134s

python-magic results

$ time python speed_test_pm.py

Import time: 0.061987199005670846
Current memory usage: 1.83419MB

Testing 1131 files
Download file time: 4.262647944997298
Unknown results types: {'.docx', '.vba', '.y4m', '.txt', '.stl', '.pem', '.json', '.bvr', 'README', '.ovpn', '.log', '.p7b'}
Unknown total: 30
Current memory usage: 1.849338MB

Peak memory usage: 1.88779MB

real    0m5.383s
user    0m0.301s
sys     0m0.290s

In this instance was:

I did also ensure that the overhead for checking unknown types was the same in both cases, and that removing it also produced same speed differences.

The only time I saw the python-magic wrapper faster is when doing 1000+ iterations over a small test string. I don't have 1000+ different strings to test with, so don't know if that's because it is faster or just cached the results. Which is causing me to think maybe should add a lru cache with configurable size.

Overall, giving me lots to think of and happy with my findings. Thanks for the inspiration @mara004

Going to keep it just as Faster in, now with proof :tm: