cdgriffith / puremagic

Pure python implementation of identifying files based off their magic numbers
MIT License
158 stars 34 forks source link

Speed Improvements #71

Open cdgriffith opened 3 months ago

cdgriffith commented 3 months ago

Talk about ideas to make PureMagic faster!

Initial thoughts:

How much does JSON slow us down? (Putting the data directly in code looks to be large speedup for repeated initialization, possibly 30%) How much does iteration vs graph slow us down? Are namedtuples the fastest way to store the data internally?

Optimizations in progress:

cdgriffith commented 3 months ago

Quick test script to run a lookup 1000 times to compare speed differences (will vary by computer, but can always test against self to show differences)

start=$( date +"%s.%N" )

for _ in $(seq 1 1000);
do
  python3 -m puremagic test/resources/media/test.iso > /dev/null
done

end=$( date +"%s.%N" )

python3 -c "print(${end} - ${start})"
cdgriffith commented 3 months ago

Tested the difference between using named tuples and classes with slots for the PureMagic internal structure.

class PureMagic:
    __slots__ = ["byte_match", "offset", "extension", "mime_type", "name"]

    def __init__(self, byte_match, offset, extension, mime_type, name):
        self.byte_match = byte_match
        self.offset = offset
        self.extension = extension
        self.mime_type = mime_type
        self.name = name

    def _asdict(self):
        return {
            "byte_match": self.byte_match,
            "offset": self.offset,
            "extension": self.extension,
            "mime_type": self.mime_type,
            "name": self.name,
        }

class PureMagicWithConfidence(PureMagic):
    __slots__ = ["name", "confidence"]

    def __init__(self, byte_match, offset, extension, mime_type, name, confidence):
        super().__init__(byte_match, offset, extension, mime_type, name)
        self.name = name
        self.confidence = confidence

vs current

PureMagic = namedtuple(
    "PureMagic",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
    ),
)

PureMagicWithConfidence = namedtuple(
    "PureMagicWithConfidence",
    (
        "byte_match",
        "offset",
        "extension",
        "mime_type",
        "name",
        "confidence",
    ),
)

named tuples still win. 42.329 seconds vs 43.922 for the classes

NebularNerd commented 3 months ago

I think speedwise that it seems much the muchness, modern CPU's are fast enough that there's little difference to be made.

On low power hardware there might be a more measurable difference. Say on a Pi or low-end x86 system where the sheer horse power is lacking.

I was worried when I suggested Multi-Match or Regex searches that we would see a noticeable increase in search times. However, on my main desktop whatever difference there is, is negligible at worst.

Would/could multi-threading the searches be another way to speed up matching. Once the data is in memory everyone can have a go at identifying it and add to the results pool. This may benefit lower spec systems by utilising their cores rather than sheer horsepower.

NebularNerd commented 3 months ago

A thought I just had, would switching to a monolithic file cause issues of its own once it grows beyond a certain point? Both from a code maintenance and physical size standpoints?

cclauss commented 3 months ago

Almost all the time in the benchmark https://github.com/cdgriffith/puremagic/issues/71#issuecomment-2106389548 above is in restarting Python over and over again.

Once Python is launched, performance is quite quick. See 0.6 sec for 74 string and file tests: % python -m pytest --cov=puremagic test/

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /home/runner/work/puremagic/puremagic
plugins: cov-5.0.0
collected 74 items

test/test_common_extensions.py .....................                     [ 28%]
test/test_main.py .....................................................  [100%]

---------- coverage: platform linux, python 3.12.3-final-0 -----------
Name                    Stmts   Miss  Cover
-------------------------------------------
puremagic/__init__.py       2      0   100%
puremagic/__main__.py       0      0   100%
puremagic/main.py         167      0   100%
-------------------------------------------
TOTAL                     169      0   100%

============================== 74 passed in 0.60s ==============================
cdgriffith commented 2 months ago

@cclauss yes specifically targeting fast multi run speed for full python initialization and load.

There are many cases this will be used from a command line, and may be called by other non python scripts repeatedly, like the file command and want it to be faster.